Training script for cosmo-1b?
#6
by
vdmbrsv
- opened
Is there a training code for cosmo-1d?
We used an internal wrapper around nanotron library https://github.com/huggingface/nanotron/ you can adapt this script https://github.com/loubnabnl/nanotron-smol-cluster/blob/main/brrr/cosmopedia/cosmo_1b.yaml
Hi @loubnabnl , thanks for pointing the yaml file. I have two questions regarding the data preprocessing part.
- Cosmopedia data was in
prompt-text
format. For pretraining, do you simply concatenate prompt and text together to form a document? - I noticed the datasets in the yaml file have different folder names,
tokenized_text_document
,tokenized_completion_document
,tokenized_train_prompt_document
,tokenized_script_document
. Does this mean different data preparation methods were used for different subsets?
Thanks a lot!
- we only train on
text
column, the prompts are not used - no we didn't do any post-processing, this is only because the target columns had different names at the time, but they were all renamed to
text
incosmopedia
Thanks @loubnabnl