What were the training datasets for this model?
#14
by
Weyaxi
- opened
Hi,
Thanks for this great model!
The datasets for other SmolLM (v1) models have been shared, but I couldn’t find the datasets used for training this model. Will the names of these datasets be released?
Hi, we will share the training details in an upcoming tech report. We use a mix of FineWeb-Edu, DCLM and The Stack with new math and code datasets that we will release with the tech report.
We released the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smoltalk
Thanks for the contributions and the release!
Weyaxi
changed discussion status to
closed