Commit History
allow remote data paths (#1278)
91cf4ee
unverified
hamel
commited on
Pretrain transforms (#1261)
c7cf381
unverified
winglian
commited on
relora: magnitude pruning of the optimizer (#1245)
8c2e05a
unverified
winglian
commited on
support for true batches with multipack (#1230)
00568c1
unverified
winglian
commited on
Fix and document test_datasets (#1228)
5787e1a
unverified
make sure to register the base chatml template even if no system message is provided (#1207)
badda37
unverified
winglian
commited on
more dpo fixes for dataset loading and docs (#1185) [skip ci]
5bce45f
unverified
winglian
commited on
Phi2 multipack (#1173)
814aee6
unverified
winglian
commited on
Add desc to map/filter (#1162)
6840381
unverified
support for explicit test_dataset definition for evals (#786)
cda52dc
unverified
winglian
commited on
Vram fix attempt (#1164) [skip ci]
32580c1
unverified
winglian
commited on
Deprecate max packed sequence len (#1141)
2ce5c0d
unverified
winglian
commited on
feat(dataset): add config to keep processed dataset in memory (#1152)
3db5f2f
unverified
Nanobit
commited on
fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli (#1136)
1e56b88
unverified
Nanobit
commited on
Preprocess dataset size fix (#1131)
7570446
unverified
winglian
commited on
streaming multipack for pretraining dataset (#959)
553c80f
unverified
fix: revert local dir dataset load (#878)
575a082
unverified
Nanobit
commited on
don't train if eval split is too small (#873)
797f3dd
unverified
winglian
commited on
Feat: Add dataset loading from S3, GCS (#765)
3cc67d2
unverified
Nanobit
commited on
cleanup the old multipack dataloader (#841)
1a6309c
unverified
winglian
commited on
multipack w batch sampler (#795)
641e6f7
unverified
winglian
commited on
update table for rwkv4 support, fix process count for dataset (#822)
cdc71f7
unverified
winglian
commited on
Create preprocess CLI (#785)
e50ab07
unverified
casperhansen
commited on
catch ConnectionError when checking dataset from HuggingFace (#743)
992d57f
unverified
Napuh
commited on
improve handling of the prepared ds path and other cfg defaults (#701)
1c412c7
unverified
winglian
commited on
Fix: Future deprecation warning with use_auth_token (#680)
69fac9a
unverified
Nanobit
commited on
prepared dataset caching, other misc fixes (#665)
e50a64e
unverified
winglian
commited on
add support for defined train split (#654)
409ca0f
unverified
winglian
commited on
Fix bug in dataset loading (#284)
8fe0e63
unverified
ethanhs
commited on
use fastchat conversations template (#578)
e7d3e2d
unverified
winglian
commited on
attention_mask not needed for training (#642)
e8cbf50
unverified
winglian
commited on
Feat(data): Allow loading local csv and text (#594)
00dce35
unverified
Nanobit
commited on
support custom field for completion from yml (#580)
f7a2263
unverified
winglian
commited on
remove columns after tokenizing for pretraining (#571)
1157950
unverified
winglian
commited on
Fix pretraining with iterable/streaming Dataset (#556)
2f586d1
unverified
Jan Philipp Harries
Jan Philipp Harries
commited on
workaround for md5 variations (#533)
0b4cf5b
unverified
winglian
commited on
support for datasets with multiple names (#480)
5ac3392
unverified
winglian
commited on
improve llama pad token handling (#475)
cb9797e
unverified
winglian
commited on
support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
d2e7f27
unverified
winglian
commited on
add utils.data.prepare_dataset
2e22404
tmm1
commited on
use context manager to run things on rank0 before others (#397)
fc2d6be
unverified
winglian
commited on
Attention mask and position id fixes for packing (#285)
2bb0b78
unverified
winglian
commited on
experimental llama 2 chat support (#296)
3392270
unverified
Jan Philipp Harries
Jan Philipp Harries
commited on
optimize the iteration when tokenizeing large datasets (#332)
fe28543
unverified
winglian
commited on
Merge pull request #276 from theobjectivedad/logging_enhancement
6f16c45
unverified
winglian
commited on
Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var
b1f4f7a
theobjectivedad
commited on