Transformers
Inference Endpoints
cosmo2-tokenizer / README.md
loubnabnl's picture
loubnabnl HF staff
Update README.md
4ce2318 verified
|
raw
history blame
296 Bytes
---
library_name: transformers
datasets:
- HuggingFaceTB/cosmo2_training_data_subset_1M
---
# cosmo2-tokenizer
Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from:
- FineWeb-Edu 70%
- Cosmopedia v2 15%
- StarCoderData 8%
- OpenWebMath 5%
- StackOverFlow 2%