Commit History

Support user-defined prompt processing strategies for dpo (#1248)
1e3d530
unverified

nopperl winglian commited on

allow remote data paths (#1278)
91cf4ee
unverified

hamel commited on

Pretrain transforms (#1261)
c7cf381
unverified

winglian commited on

relora: magnitude pruning of the optimizer (#1245)
8c2e05a
unverified

winglian commited on

support for true batches with multipack (#1230)
00568c1
unverified

winglian commited on

Fix and document test_datasets (#1228)
5787e1a
unverified

DreamGenX winglian commited on

make sure to register the base chatml template even if no system message is provided (#1207)
badda37
unverified

winglian commited on

more dpo fixes for dataset loading and docs (#1185) [skip ci]
5bce45f
unverified

winglian commited on

Phi2 multipack (#1173)
814aee6
unverified

winglian commited on

support for explicit test_dataset definition for evals (#786)
cda52dc
unverified

winglian commited on

Vram fix attempt (#1164) [skip ci]
32580c1
unverified

winglian commited on

Deprecate max packed sequence len (#1141)
2ce5c0d
unverified

winglian commited on

feat(dataset): add config to keep processed dataset in memory (#1152)
3db5f2f
unverified

Nanobit commited on

fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli (#1136)
1e56b88
unverified

Nanobit commited on

Preprocess dataset size fix (#1131)
7570446
unverified

winglian commited on

Efficiently get the length of the tokenized docs (#1063)
81d3845
unverified

ricdomolm winglian commited on

streaming multipack for pretraining dataset (#959)
553c80f
unverified

jinwonkim93 jinwonkim93@github.com winglian commited on

fix: revert local dir dataset load (#878)
575a082
unverified

Nanobit commited on

don't train if eval split is too small (#873)
797f3dd
unverified

winglian commited on

Feat: Add dataset loading from S3, GCS (#765)
3cc67d2
unverified

Nanobit commited on

Update data.py for signature generation (#851)
48630f5
unverified

MilesQLi winglian commited on

cleanup the old multipack dataloader (#841)
1a6309c
unverified

winglian commited on

multipack w batch sampler (#795)
641e6f7
unverified

winglian commited on

update table for rwkv4 support, fix process count for dataset (#822)
cdc71f7
unverified

winglian commited on

Create preprocess CLI (#785)
e50ab07
unverified

casperhansen commited on

catch ConnectionError when checking dataset from HuggingFace (#743)
992d57f
unverified

Napuh commited on

improve handling of the prepared ds path and other cfg defaults (#701)
1c412c7
unverified

winglian commited on

Fix: Future deprecation warning with use_auth_token (#680)
69fac9a
unverified

Nanobit commited on

prepared dataset caching, other misc fixes (#665)
e50a64e
unverified

winglian commited on

add support for defined train split (#654)
409ca0f
unverified

winglian commited on

Fix bug in dataset loading (#284)
8fe0e63
unverified

ethanhs commited on

use fastchat conversations template (#578)
e7d3e2d
unverified

winglian commited on

attention_mask not needed for training (#642)
e8cbf50
unverified

winglian commited on

Feat(data): Allow loading local csv and text (#594)
00dce35
unverified

Nanobit commited on

support custom field for completion from yml (#580)
f7a2263
unverified

winglian commited on

remove columns after tokenizing for pretraining (#571)
1157950
unverified

winglian commited on

Fix pretraining with iterable/streaming Dataset (#556)
2f586d1
unverified

Jan Philipp Harries Jan Philipp Harries commited on

workaround for md5 variations (#533)
0b4cf5b
unverified

winglian commited on

support for datasets with multiple names (#480)
5ac3392
unverified

winglian commited on

improve llama pad token handling (#475)
cb9797e
unverified

winglian commited on

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
d2e7f27
unverified

winglian commited on

add utils.data.prepare_dataset
2e22404

tmm1 commited on

use context manager to run things on rank0 before others (#397)
fc2d6be
unverified

winglian commited on

Attention mask and position id fixes for packing (#285)
2bb0b78
unverified

winglian commited on

experimental llama 2 chat support (#296)
3392270
unverified

Jan Philipp Harries Jan Philipp Harries commited on

optimize the iteration when tokenizeing large datasets (#332)
fe28543
unverified

winglian commited on

Merge pull request #276 from theobjectivedad/logging_enhancement
6f16c45
unverified

winglian commited on

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var
b1f4f7a

theobjectivedad commited on