Transformers
Inference Endpoints
File size: 296 Bytes
acbcf9b
 
6f9fdab
 
acbcf9b
 
6f9fdab
c3735a7
4ce2318
c3735a7
 
 
6f9fdab
1
2
3
4
5
6
7
8
9
10
11
12
13
---
library_name: transformers
datasets:
- HuggingFaceTB/cosmo2_training_data_subset_1M
---

# cosmo2-tokenizer
 Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from:
 - FineWeb-Edu 70%
 - Cosmopedia v2 15%
 - StarCoderData  8%
 - OpenWebMath 5%
 - StackOverFlow 2%