--- library_name: transformers datasets: - HuggingFaceTB/cosmo2_training_data_subset_1M --- # cosmo2-tokenizer Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from: - FineWeb-Edu 80% - Cosmopedia v2 15% - StarCoderData 8% - OpenWebMath 5% - StackOverFlow 2%