Commit History

support custom field for completion from yml (#580)
f7a2263
unverified

winglian commited on

remove columns after tokenizing for pretraining (#571)
1157950
unverified

winglian commited on

Fix pretraining with iterable/streaming Dataset (#556)
2f586d1
unverified

Jan Philipp Harries Jan Philipp Harries commited on

workaround for md5 variations (#533)
0b4cf5b
unverified

winglian commited on

support for datasets with multiple names (#480)
5ac3392
unverified

winglian commited on

improve llama pad token handling (#475)
cb9797e
unverified

winglian commited on

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
d2e7f27
unverified

winglian commited on

add utils.data.prepare_dataset
2e22404

tmm1 commited on

use context manager to run things on rank0 before others (#397)
fc2d6be
unverified

winglian commited on

Attention mask and position id fixes for packing (#285)
2bb0b78
unverified

winglian commited on

experimental llama 2 chat support (#296)
3392270
unverified

Jan Philipp Harries Jan Philipp Harries commited on

optimize the iteration when tokenizeing large datasets (#332)
fe28543
unverified

winglian commited on

Merge pull request #276 from theobjectivedad/logging_enhancement
6f16c45
unverified

winglian commited on

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var
b1f4f7a

theobjectivedad commited on

Add ability to pass 'name' argument to load_dataset
88089e8

chargoddard commited on

Support loading data files from a local directory
9bdd30c

utensil commited on

Merge branch 'main' into flash-optimum
fd2c981
unverified

winglian commited on

add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed
aac4b76

winglian commited on

address PR feedback
0c6f928

winglian commited on

add streaming dataset support for pretraining datasets
eea2731

winglian commited on

more gpt-neox long ctx fixes
ab5cd28

winglian commited on

more tweaks to do pre-training with bettertransformers
1210dc8

winglian commited on

experimental expansion of ctx len
488a67d

winglian commited on

Set to use cfg.seed or 42 for backward compat
2cfe9e9

Nanobit commited on

fix batch size calculation
5a631b3

winglian commited on

Fix security issue or ignore false positives
a1f9850

Nanobit commited on

Apply isort then black
37293dc

Nanobit commited on

Fix mypy typing
e9650d3

Nanobit commited on

Black formatting
b832a0a

Nanobit commited on

Refactor
4c0eddb

Nanobit commited on

Fix data.py lint
cb7cd34

Nanobit commited on

Lint and format
392dfd9

Nanobit commited on

new hf_use_auth_token setting so login to hf isn't required
1c33eb8

winglian commited on

update readme and add typehints
a4f1241

winglian commited on

fix merge conflict failure, black format
7b5e762

winglian commited on

another fix for shard and train split
2e56203

winglian commited on

shard fix
ac79360

winglian commited on

apply black formatting
ce34d64

winglian commited on

more qlora support
e8aacfb

winglian commited on

be able to use adam bnb 8bit and one cycle scheduler w fsdp
9493b1b

winglian commited on

Update src/axolotl/utils/data.py for spelling
98a6781
unverified

winglian Nanobit commited on

make sure to use train split if loading from hf
607a4d3

winglian commited on

fix new dataset prompt tokenizers
0f74464

winglian commited on

pygmalion dataset prompts format, cached tokenized datasets should be hashed on the tokenizer too
2809f3f

winglian commited on

tokenization fixes
4ea9a66

winglian commited on

optionally be able to specify alpaca or chat style prompts
1d5ab84

winglian commited on

concise multiple choice and tldr summarize
1365073

winglian commited on

add alpaca multiple choice instruct dataset support
b46bc02

winglian commited on

move filter to before saving so it doesn't happen everytime, update runpod manual script
0d28df0

winglian commited on