Training script for cosmo-1b?

#6
by vdmbrsv - opened

Is there a training code for cosmo-1d?

Hugging Face TB Research org
β€’
edited Mar 25

Hi @loubnabnl , thanks for pointing the yaml file. I have two questions regarding the data preprocessing part.

  1. Cosmopedia data was in prompt-text format. For pretraining, do you simply concatenate prompt and text together to form a document?
  2. I noticed the datasets in the yaml file have different folder names, tokenized_text_document, tokenized_completion_document, tokenized_train_prompt_document, tokenized_script_document. Does this mean different data preparation methods were used for different subsets?

Thanks a lot!

Hugging Face TB Research org
  • we only train on text column, the prompts are not used
  • no we didn't do any post-processing, this is only because the target columns had different names at the time, but they were all renamed to text incosmopedia

Sign up or log in to comment