Commit History

Add ability to pass 'name' argument to load_dataset
88089e8

chargoddard commited on

Support loading data files from a local directory
9bdd30c

utensil commited on

Merge branch 'main' into flash-optimum
fd2c981
unverified

winglian commited on

add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed
aac4b76

winglian commited on

address PR feedback
0c6f928

winglian commited on

add streaming dataset support for pretraining datasets
eea2731

winglian commited on

more gpt-neox long ctx fixes
ab5cd28

winglian commited on

more tweaks to do pre-training with bettertransformers
1210dc8

winglian commited on

experimental expansion of ctx len
488a67d

winglian commited on

Set to use cfg.seed or 42 for backward compat
2cfe9e9

Nanobit commited on

fix batch size calculation
5a631b3

winglian commited on

Fix security issue or ignore false positives
a1f9850

Nanobit commited on

Apply isort then black
37293dc

Nanobit commited on

Fix mypy typing
e9650d3

Nanobit commited on

Black formatting
b832a0a

Nanobit commited on

Refactor
4c0eddb

Nanobit commited on

Fix data.py lint
cb7cd34

Nanobit commited on

Lint and format
392dfd9

Nanobit commited on

new hf_use_auth_token setting so login to hf isn't required
1c33eb8

winglian commited on

update readme and add typehints
a4f1241

winglian commited on

fix merge conflict failure, black format
7b5e762

winglian commited on

another fix for shard and train split
2e56203

winglian commited on

shard fix
ac79360

winglian commited on

apply black formatting
ce34d64

winglian commited on

more qlora support
e8aacfb

winglian commited on

be able to use adam bnb 8bit and one cycle scheduler w fsdp
9493b1b

winglian commited on

Update src/axolotl/utils/data.py for spelling
98a6781
unverified

winglian Nanobit commited on

make sure to use train split if loading from hf
607a4d3

winglian commited on

fix new dataset prompt tokenizers
0f74464

winglian commited on

pygmalion dataset prompts format, cached tokenized datasets should be hashed on the tokenizer too
2809f3f

winglian commited on

tokenization fixes
4ea9a66

winglian commited on

optionally be able to specify alpaca or chat style prompts
1d5ab84

winglian commited on

concise multiple choice and tldr summarize
1365073

winglian commited on

add alpaca multiple choice instruct dataset support
b46bc02

winglian commited on

move filter to before saving so it doesn't happen everytime, update runpod manual script
0d28df0

winglian commited on

whoops, gt vs lt
84c7bc4

winglian commited on

optimize dataloading to use cache, fix model token embedding sizes
aa3c3f9

winglian commited on

black formatting
2bc1a5b

winglian commited on

fix conditional so alpaca doesn't choke
a27d594

winglian commited on

Add CompletionPrompt type
cf68153

Nanobit commited on

Jeopardy bot! (#17)
a12fb0a
unverified

winglian commited on

fix dataset handling, support galactica
4a17a4c

winglian commited on

tweaks to data loading, 8 bit adam, accelerate and deepspeed
097d367

winglian commited on

shuffle and split dataset after save/load
4f2584f

winglian commited on

fix sharegpt handling from hf, don't worry about loading llama if using earlier transformers release
8d43785

winglian commited on

various bugfixes
94f5e41

winglian commited on

WIP large refactor to make finetune script a little more manageable (#3)
6045345
unverified

winglian commited on