What were the training datasets for this model?

#14
by Weyaxi - opened

Hi,

Thanks for this great model!

The datasets for other SmolLM (v1) models have been shared, but I couldn’t find the datasets used for training this model. Will the names of these datasets be released?

Hugging Face TB Research org

Hi, we will share the training details in an upcoming tech report. We use a mix of FineWeb-Edu, DCLM and The Stack with new math and code datasets that we will release with the tech report.

Hugging Face TB Research org

We released the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smoltalk

Thanks for the contributions and the release!

Weyaxi changed discussion status to closed

Sign up or log in to comment