Transformers
Inference Endpoints
cosmo2-tokenizer / README.md
loubnabnl's picture
loubnabnl HF staff
Update README.md
6f9fdab verified
|
raw
history blame
296 Bytes
metadata
library_name: transformers
datasets:
  - HuggingFaceTB/cosmo2_training_data_subset_1M

cosmo2-tokenizer

Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from:

  • FineWeb-Edu 80%
  • Cosmopedia v2 15%
  • StarCoderData 8%
  • OpenWebMath 5%
  • StackOverFlow 2%