Update README.md
Browse files
README.md
CHANGED
@@ -93,20 +93,31 @@ To make sure this classifier makes sense, it is applied to various datasets.
|
|
93 |
|
94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
95 |
|
96 |
-
|Dataset | Sampling | Average Educational Value |
|
97 |
-
|--------------------------------------|---|-------------------|
|
98 |
-
|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000 | 1.846 |
|
99 |
-
|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000 | 1.668 |
|
100 |
-
|[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 10,000| 1.664|
|
101 |
-
|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000 | 1.581 |
|
102 |
-
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 10,000 | 1.256|
|
103 |
-
|[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 10,000 | 1.237|
|
104 |
-
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 10,000 | 1.069 |
|
105 |
-
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|
|
106 |
-
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|
|
107 |
-
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|
|
108 |
-
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|
|
109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
|
111 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/fCWpxWB1yLmJwWhPXsjIw.png)
|
112 |
|
@@ -114,3 +125,6 @@ The classifier aligns with the expectation. Textbook category scores the highest
|
|
114 |
Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
|
115 |
Web scores the lowest.
|
116 |
|
|
|
|
|
|
|
|
93 |
|
94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
|
97 |
+
|Dataset | Sampling | Average Educational Value | Type |
|
98 |
+
|--------------------------------------|---|-------------------|-------|
|
99 |
+
|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000 | 1.846 |Synthetic|
|
100 |
+
|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000 | 1.668 |Synthetic|
|
101 |
+
|[HuggingFaceTB/cosmopedia stanford](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.673 |Synthetic|
|
102 |
+
|[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 10,000| 1.664|Synthetic|
|
103 |
+
|[HuggingFaceTB/cosmopedia web_samples_v1](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.615 |Synthetic|
|
104 |
+
|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000 | 1.581 |Synthetic|Synthetic|
|
105 |
+
|[HuggingFaceTB/cosmopedia web_samples_v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.559 |Synthetic|
|
106 |
+
|[HuggingFaceTB/cosmopedia openstax](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.468 |Synthetic|
|
107 |
+
|[HuggingFaceTB/cosmopedia wikihow](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.420 |Synthetic|
|
108 |
+
|[HuggingFaceTB/cosmopedia khanacademy](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.378 |Synthetic|
|
109 |
+
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.350 |Synthetic|
|
110 |
+
|[NousResearch/dolma-v1_7-305B*](https://huggingface.co/datasets/NousResearch/dolma-v1_7-305B) |First 10,000 | 1.290|Real|
|
111 |
+
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 10,000 | 1.256|Real|
|
112 |
+
|[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 10,000 | 1.237|Real|
|
113 |
+
|[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.156 |Synthetic|
|
114 |
+
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 10,000 | 1.069 |Real|
|
115 |
+
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
|
116 |
+
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
117 |
+
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
118 |
+
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
119 |
+
|
120 |
+
* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
121 |
|
122 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/fCWpxWB1yLmJwWhPXsjIw.png)
|
123 |
|
|
|
125 |
Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
|
126 |
Web scores the lowest.
|
127 |
|
128 |
+
In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
129 |
+
For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied extensive quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
|
130 |
+
|