File size: 296 Bytes
acbcf9b 6f9fdab acbcf9b 6f9fdab c3735a7 4ce2318 c3735a7 6f9fdab |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
---
library_name: transformers
datasets:
- HuggingFaceTB/cosmo2_training_data_subset_1M
---
# cosmo2-tokenizer
Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from:
- FineWeb-Edu 70%
- Cosmopedia v2 15%
- StarCoderData 8%
- OpenWebMath 5%
- StackOverFlow 2% |