HuggingFaceTB/cosmo-1b · Training script for cosmo-1b?

Mar 24

Is there a training code for cosmo-1d?

Hugging Face TB Research org Mar 25

•

May 24

Hi @loubnabnl , thanks for pointing the yaml file. I have two questions regarding the data preprocessing part.

Cosmopedia data was in prompt-text format. For pretraining, do you simply concatenate prompt and text together to form a document?
I noticed the datasets in the yaml file have different folder names, tokenized_text_document, tokenized_completion_document, tokenized_train_prompt_document, tokenized_script_document. Does this mean different data preparation methods were used for different subsets?

Thanks a lot!

Hugging Face TB Research org May 25

we only train on text column, the prompts are not used
no we didn't do any post-processing, this is only because the target columns had different names at the time, but they were all renamed to text incosmopedia

May 26