Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
-
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 19.9k β’ 851 -
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 8.46k β’ 87 -
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 70 β’ 17 -
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 73.8k β’ 4
-
FineWeb: decanting the web for the finest text data at scale
π·1.33kExplore and download the FineWeb webβtext dataset
-
HuggingFaceFW/fineweb
Viewer β’ Updated β’ 52.5B β’ 649k β’ 2.77k -
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.5B β’ 355k β’ 1.04k -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.9B β’ 18.4k β’ 85
1.8B models trained on 350BT to compare different pretraining datasets
-
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 404 β’ 23 -
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation β’ 2B β’ Updated β’ 267 β’ 14 -
HuggingFaceFW/ablation-model-refinedweb
Text Generation β’ 2B β’ Updated β’ 19 β’ 3 -
HuggingFaceFW/ablation-model-c4
Text Generation β’ 2B β’ Updated β’ 12 β’ 4
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper β’ 2506.20920 β’ Published β’ 78 -
HuggingFaceFW/fineweb-2
Viewer β’ Updated β’ 4.48B β’ 118k β’ 784 -
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
π91Evaluate multilingual models using FineTasks
FineWeb-Edu datasets, classifier and ablation model
-
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.5B β’ 355k β’ 1.04k -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.9B β’ 18.4k β’ 85 -
HuggingFaceFW/fineweb-edu-classifier
Text Classification β’ 0.1B β’ Updated β’ 35.1k β’ β’ 211 -
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 404 β’ 23
Ablation models trained for our data experiments.
-
HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation β’ 2B β’ Updated β’ 7 β’ 1 -
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation β’ 2B β’ Updated β’ 7 -
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation β’ 2B β’ Updated β’ 6 -
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation β’ 2B β’ Updated β’ 4
Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing
-
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 19.9k β’ 851 -
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 8.46k β’ 87 -
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 70 β’ 17 -
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 73.8k β’ 4
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper β’ 2506.20920 β’ Published β’ 78 -
HuggingFaceFW/fineweb-2
Viewer β’ Updated β’ 4.48B β’ 118k β’ 784 -
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
π91Evaluate multilingual models using FineTasks
-
FineWeb: decanting the web for the finest text data at scale
π·1.33kExplore and download the FineWeb webβtext dataset
-
HuggingFaceFW/fineweb
Viewer β’ Updated β’ 52.5B β’ 649k β’ 2.77k -
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.5B β’ 355k β’ 1.04k -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.9B β’ 18.4k β’ 85
FineWeb-Edu datasets, classifier and ablation model
-
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.5B β’ 355k β’ 1.04k -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.9B β’ 18.4k β’ 85 -
HuggingFaceFW/fineweb-edu-classifier
Text Classification β’ 0.1B β’ Updated β’ 35.1k β’ β’ 211 -
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 404 β’ 23
1.8B models trained on 350BT to compare different pretraining datasets
-
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 404 β’ 23 -
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation β’ 2B β’ Updated β’ 267 β’ 14 -
HuggingFaceFW/ablation-model-refinedweb
Text Generation β’ 2B β’ Updated β’ 19 β’ 3 -
HuggingFaceFW/ablation-model-c4
Text Generation β’ 2B β’ Updated β’ 12 β’ 4
Ablation models trained for our data experiments.
-
HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation β’ 2B β’ Updated β’ 7 β’ 1 -
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation β’ 2B β’ Updated β’ 7 -
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation β’ 2B β’ Updated β’ 6 -
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation β’ 2B β’ Updated β’ 4