Commits · Dovakiins/qwerrwe

Support user-defined prompt processing strategies for dpo (#1248)

1e3d530
unverified

nopperl

winglian commited on Feb 26

allow remote data paths (#1278)

91cf4ee
unverified

hamel commited on Feb 8

Pretrain transforms (#1261)

c7cf381
unverified

winglian commited on Feb 6

relora: magnitude pruning of the optimizer (#1245)

8c2e05a
unverified

winglian commited on Feb 6

support for true batches with multipack (#1230)

00568c1
unverified

winglian commited on Feb 1

Fix and document test_datasets (#1228)

5787e1a
unverified

DreamGenX

winglian commited on Jan 31

make sure to register the base chatml template even if no system message is provided (#1207)

badda37
unverified

winglian commited on Jan 25

more dpo fixes for dataset loading and docs (#1185) [skip ci]

5bce45f
unverified

winglian commited on Jan 24

Phi2 multipack (#1173)

814aee6
unverified

winglian commited on Jan 23

DPO cleanup (#1126)

7523d1f
unverified

winglian

plaguss HF staff commited on Jan 23

Add desc to map/filter (#1162)

6840381
unverified

casperhansen

winglian commited on Jan 23

support for explicit test_dataset definition for evals (#786)

cda52dc
unverified

winglian commited on Jan 23

Vram fix attempt (#1164) [skip ci]

32580c1
unverified

winglian commited on Jan 23

Deprecate max packed sequence len (#1141)

2ce5c0d
unverified

winglian commited on Jan 20

feat(dataset): add config to keep processed dataset in memory (#1152)

3db5f2f
unverified

Nanobit commited on Jan 20

fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli (#1136)

1e56b88
unverified

Nanobit commited on Jan 17

Preprocess dataset size fix (#1131)

7570446
unverified

winglian commited on Jan 17

Efficiently get the length of the tokenized docs (#1063)

81d3845
unverified

ricdomolm

winglian commited on Jan 8

streaming multipack for pretraining dataset (#959)

553c80f
unverified

jinwonkim93 jinwonkim93@github.com

winglian commited on Jan 6

fix: revert local dir dataset load (#878)

575a082
unverified

Nanobit commited on Nov 18, 2023

don't train if eval split is too small (#873)

797f3dd
unverified

winglian commited on Nov 16, 2023

Feat: Add dataset loading from S3, GCS (#765)

3cc67d2
unverified

Nanobit commited on Nov 16, 2023

Update data.py for signature generation (#851)

48630f5
unverified

MilesQLi

winglian commited on Nov 15, 2023

cleanup the old multipack dataloader (#841)

1a6309c
unverified

winglian commited on Nov 12, 2023

multipack w batch sampler (#795)

641e6f7
unverified

winglian commited on Nov 8, 2023

update table for rwkv4 support, fix process count for dataset (#822)

cdc71f7
unverified

winglian commited on Nov 5, 2023

Create preprocess CLI (#785)

e50ab07
unverified

casperhansen commited on Oct 26, 2023

catch ConnectionError when checking dataset from HuggingFace (#743)

992d57f
unverified

Napuh commited on Oct 19, 2023

improve handling of the prepared ds path and other cfg defaults (#701)

1c412c7
unverified

winglian commited on Oct 13, 2023

Fix: Future deprecation warning with use_auth_token (#680)

69fac9a
unverified

Nanobit commited on Oct 5, 2023

prepared dataset caching, other misc fixes (#665)

e50a64e
unverified

winglian commited on Oct 3, 2023

add support for defined train split (#654)

409ca0f
unverified

winglian commited on Sep 29, 2023

Fix bug in dataset loading (#284)

8fe0e63
unverified

ethanhs commited on Sep 27, 2023

use fastchat conversations template (#578)

e7d3e2d
unverified

winglian commited on Sep 27, 2023

attention_mask not needed for training (#642)

e8cbf50
unverified

winglian commited on Sep 27, 2023

Feat(data): Allow loading local csv and text (#594)

00dce35
unverified

Nanobit commited on Sep 17, 2023

support custom field for completion from yml (#580)

f7a2263
unverified

winglian commited on Sep 15, 2023

remove columns after tokenizing for pretraining (#571)

1157950
unverified

winglian commited on Sep 14, 2023

Fix pretraining with iterable/streaming Dataset (#556)

2f586d1
unverified

Jan Philipp Harries Jan Philipp Harries commited on Sep 13, 2023

workaround for md5 variations (#533)

0b4cf5b
unverified

winglian commited on Sep 8, 2023

support for datasets with multiple names (#480)

5ac3392
unverified

winglian commited on Aug 29, 2023

improve llama pad token handling (#475)

cb9797e
unverified

winglian commited on Aug 24, 2023

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)

d2e7f27
unverified

winglian commited on Aug 20, 2023

add utils.data.prepare_dataset

2e22404

tmm1 commited on Aug 15, 2023

use context manager to run things on rank0 before others (#397)

fc2d6be
unverified

winglian commited on Aug 15, 2023

Attention mask and position id fixes for packing (#285)

2bb0b78
unverified

winglian commited on Aug 12, 2023

experimental llama 2 chat support (#296)

3392270
unverified

Jan Philipp Harries Jan Philipp Harries commited on Aug 6, 2023

optimize the iteration when tokenizeing large datasets (#332)

fe28543
unverified

winglian commited on Aug 4, 2023

Merge pull request #276 from theobjectivedad/logging_enhancement

6f16c45
unverified

winglian commited on Jul 16, 2023

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var

b1f4f7a

theobjectivedad commited on Jul 15, 2023

Commit History

Support user-defined prompt processing strategies for dpo (#1248) 1e3d530 unverified

allow remote data paths (#1278) 91cf4ee unverified

Pretrain transforms (#1261) c7cf381 unverified

relora: magnitude pruning of the optimizer (#1245) 8c2e05a unverified

support for true batches with multipack (#1230) 00568c1 unverified

Fix and document test_datasets (#1228) 5787e1a unverified

make sure to register the base chatml template even if no system message is provided (#1207) badda37 unverified

more dpo fixes for dataset loading and docs (#1185) [skip ci] 5bce45f unverified

Phi2 multipack (#1173) 814aee6 unverified

DPO cleanup (#1126) 7523d1f unverified

Add desc to map/filter (#1162) 6840381 unverified

support for explicit test_dataset definition for evals (#786) cda52dc unverified

Vram fix attempt (#1164) [skip ci] 32580c1 unverified

Deprecate max packed sequence len (#1141) 2ce5c0d unverified

feat(dataset): add config to keep processed dataset in memory (#1152) 3db5f2f unverified

fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli (#1136) 1e56b88 unverified

Preprocess dataset size fix (#1131) 7570446 unverified

Efficiently get the length of the tokenized docs (#1063) 81d3845 unverified

streaming multipack for pretraining dataset (#959) 553c80f unverified

fix: revert local dir dataset load (#878) 575a082 unverified

don't train if eval split is too small (#873) 797f3dd unverified

Feat: Add dataset loading from S3, GCS (#765) 3cc67d2 unverified

Update data.py for signature generation (#851) 48630f5 unverified

cleanup the old multipack dataloader (#841) 1a6309c unverified

multipack w batch sampler (#795) 641e6f7 unverified

update table for rwkv4 support, fix process count for dataset (#822) cdc71f7 unverified

Create preprocess CLI (#785) e50ab07 unverified

catch ConnectionError when checking dataset from HuggingFace (#743) 992d57f unverified

improve handling of the prepared ds path and other cfg defaults (#701) 1c412c7 unverified

Fix: Future deprecation warning with use_auth_token (#680) 69fac9a unverified

prepared dataset caching, other misc fixes (#665) e50a64e unverified

add support for defined train split (#654) 409ca0f unverified

Fix bug in dataset loading (#284) 8fe0e63 unverified

use fastchat conversations template (#578) e7d3e2d unverified

attention_mask not needed for training (#642) e8cbf50 unverified

Feat(data): Allow loading local csv and text (#594) 00dce35 unverified

support custom field for completion from yml (#580) f7a2263 unverified

remove columns after tokenizing for pretraining (#571) 1157950 unverified

Fix pretraining with iterable/streaming Dataset (#556) 2f586d1 unverified

workaround for md5 variations (#533) 0b4cf5b unverified

support for datasets with multiple names (#480) 5ac3392 unverified

improve llama pad token handling (#475) cb9797e unverified

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348) d2e7f27 unverified

add utils.data.prepare_dataset 2e22404

use context manager to run things on rank0 before others (#397) fc2d6be unverified

Attention mask and position id fixes for packing (#285) 2bb0b78 unverified

experimental llama 2 chat support (#296) 3392270 unverified

optimize the iteration when tokenizeing large datasets (#332) fe28543 unverified

Merge pull request #276 from theobjectivedad/logging_enhancement 6f16c45 unverified

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var b1f4f7a

Support user-defined prompt processing strategies for dpo (#1248)

1e3d530
unverified

allow remote data paths (#1278)

91cf4ee
unverified

Pretrain transforms (#1261)

c7cf381
unverified

relora: magnitude pruning of the optimizer (#1245)

8c2e05a
unverified

support for true batches with multipack (#1230)

00568c1
unverified

Fix and document test_datasets (#1228)

5787e1a
unverified

make sure to register the base chatml template even if no system message is provided (#1207)

badda37
unverified

more dpo fixes for dataset loading and docs (#1185) [skip ci]

5bce45f
unverified

Phi2 multipack (#1173)

814aee6
unverified

DPO cleanup (#1126)

7523d1f
unverified

Add desc to map/filter (#1162)

6840381
unverified

support for explicit test_dataset definition for evals (#786)

cda52dc
unverified

Vram fix attempt (#1164) [skip ci]

32580c1
unverified

Deprecate max packed sequence len (#1141)

2ce5c0d
unverified

feat(dataset): add config to keep processed dataset in memory (#1152)

3db5f2f
unverified

fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli (#1136)

1e56b88
unverified

Preprocess dataset size fix (#1131)

7570446
unverified

Efficiently get the length of the tokenized docs (#1063)

81d3845
unverified

streaming multipack for pretraining dataset (#959)

553c80f
unverified

fix: revert local dir dataset load (#878)

575a082
unverified

don't train if eval split is too small (#873)

797f3dd
unverified

Feat: Add dataset loading from S3, GCS (#765)

3cc67d2
unverified

Update data.py for signature generation (#851)

48630f5
unverified

cleanup the old multipack dataloader (#841)

1a6309c
unverified

multipack w batch sampler (#795)

641e6f7
unverified

update table for rwkv4 support, fix process count for dataset (#822)

cdc71f7
unverified

Create preprocess CLI (#785)

e50ab07
unverified

catch ConnectionError when checking dataset from HuggingFace (#743)

992d57f
unverified

improve handling of the prepared ds path and other cfg defaults (#701)

1c412c7
unverified

Fix: Future deprecation warning with use_auth_token (#680)

69fac9a
unverified

prepared dataset caching, other misc fixes (#665)

e50a64e
unverified

add support for defined train split (#654)

409ca0f
unverified

Fix bug in dataset loading (#284)

8fe0e63
unverified

use fastchat conversations template (#578)

e7d3e2d
unverified

attention_mask not needed for training (#642)

e8cbf50
unverified

Feat(data): Allow loading local csv and text (#594)

00dce35
unverified

support custom field for completion from yml (#580)

f7a2263
unverified

remove columns after tokenizing for pretraining (#571)

1157950
unverified

Fix pretraining with iterable/streaming Dataset (#556)

2f586d1
unverified

workaround for md5 variations (#533)

0b4cf5b
unverified

support for datasets with multiple names (#480)

5ac3392
unverified

improve llama pad token handling (#475)

cb9797e
unverified

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)

d2e7f27
unverified

add utils.data.prepare_dataset

2e22404

use context manager to run things on rank0 before others (#397)

fc2d6be
unverified

Attention mask and position id fixes for packing (#285)

2bb0b78
unverified

experimental llama 2 chat support (#296)

3392270
unverified

optimize the iteration when tokenizeing large datasets (#332)

fe28543
unverified

Merge pull request #276 from theobjectivedad/logging_enhancement

6f16c45
unverified

Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var

b1f4f7a