library_name: transformers | |
datasets: | |
- HuggingFaceTB/cosmo2_training_data_subset_1M | |
# cosmo2-tokenizer | |
Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from: | |
- FineWeb-Edu 70% | |
- Cosmopedia v2 15% | |
- StarCoderData 8% | |
- OpenWebMath 5% | |
- StackOverFlow 2% |