Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ pipeline_tag: text-classification
|
|
9 |
---
|
10 |
# llm-data-textbook-quality-fasttext-classifer-v2
|
11 |
This classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
|
12 |
-
Model is built on fasttext - it can classify more than 2000 examples per second in CPU
|
13 |
This model can classify if a text has high educational value (more explicitly defined then textbook quality). This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
|
14 |
It can be used as a filter for data curation when training a LLM.
|
15 |
There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
|
@@ -18,7 +18,8 @@ There are 3 labels instead of 2 labels, as it offers higher granularity of educa
|
|
18 |
- Low (Bottom 25% educational value)
|
19 |
|
20 |
Please note textbook quality is a subset of high quality.
|
21 |
-
A detailed report/ paper will follow when more downstream experiments of this classifier become available.
|
|
|
22 |
|
23 |
|
24 |
## Feedback welcomed!
|
@@ -93,7 +94,7 @@ To make sure this classifier makes sense, it is applied to various datasets.
|
|
93 |
|
94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
95 |
|
96 |
-
The score can be interpreted as:
|
97 |
|Educational Value| Category |
|
98 |
|--------|----------|
|
99 |
|2 | High|
|
@@ -103,39 +104,39 @@ The score can be interpreted as:
|
|
103 |
|
104 |
|Dataset | Sampling | Average Educational Value | Type |
|
105 |
|--------------------------------------|---|-------------------|-------|
|
106 |
-
|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First
|
107 |
-
|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First
|
108 |
-
|[HuggingFaceTB/cosmopedia stanford](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
109 |
-
|[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First
|
110 |
-
|[HuggingFaceTB/cosmopedia web_samples_v1](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
111 |
-
|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First
|
112 |
-
|[HuggingFaceTB/cosmopedia web_samples_v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
113 |
-
|[HuggingFaceTB/cosmopedia openstax](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
114 |
-
|[HuggingFaceTB/cosmopedia wikihow](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
115 |
-
|[HuggingFaceTB/cosmopedia khanacademy](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
116 |
-
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First
|
117 |
-
|[
|
118 |
-
|[
|
119 |
-
|[
|
120 |
-
|[
|
121 |
-
|[
|
122 |
-
|[
|
123 |
-
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First
|
124 |
-
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First
|
125 |
-
|[togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First
|
126 |
-
|[
|
127 |
-
|[
|
128 |
-
|
129 |
-
|
130 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
131 |
|
132 |
|
133 |
The classifier aligns with the expectation.
|
134 |
|
135 |
- In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
136 |
-
For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d),
|
137 |
-
-
|
138 |
-
-
|
139 |
-
-
|
140 |
-
|
141 |
-
|
|
|
9 |
---
|
10 |
# llm-data-textbook-quality-fasttext-classifer-v2
|
11 |
This classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
|
12 |
+
Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so I can be used **on-the-fly**.
|
13 |
This model can classify if a text has high educational value (more explicitly defined then textbook quality). This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
|
14 |
It can be used as a filter for data curation when training a LLM.
|
15 |
There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
|
|
|
18 |
- Low (Bottom 25% educational value)
|
19 |
|
20 |
Please note textbook quality is a subset of high quality.
|
21 |
+
A detailed report/ paper will follow when more downstream experiments of this classifier become available.
|
22 |
+
The classifier had been applied to various pretraining dataset. See [**Benchmark**](#Benchmark)
|
23 |
|
24 |
|
25 |
## Feedback welcomed!
|
|
|
94 |
|
95 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
96 |
|
97 |
+
The score can be roughly interpreted as:
|
98 |
|Educational Value| Category |
|
99 |
|--------|----------|
|
100 |
|2 | High|
|
|
|
104 |
|
105 |
|Dataset | Sampling | Average Educational Value | Type |
|
106 |
|--------------------------------------|---|-------------------|-------|
|
107 |
+
|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 100,000 | 1.846 |Synthetic|
|
108 |
+
|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 100,000 | 1.673 |Synthetic|
|
109 |
+
|[HuggingFaceTB/cosmopedia stanford](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.673 |Synthetic|
|
110 |
+
|[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 100,000| 1.663|Synthetic|
|
111 |
+
|[HuggingFaceTB/cosmopedia web_samples_v1](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.618 |Synthetic|
|
112 |
+
|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 100,000 | 1.586 |Synthetic|Synthetic|
|
113 |
+
|[HuggingFaceTB/cosmopedia web_samples_v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.562 |Synthetic|
|
114 |
+
|[HuggingFaceTB/cosmopedia openstax](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.462 |Synthetic|
|
115 |
+
|[HuggingFaceTB/cosmopedia wikihow](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.422 |Synthetic|
|
116 |
+
|[HuggingFaceTB/cosmopedia khanacademy](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.419 |Synthetic|
|
117 |
+
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
|
118 |
+
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
|
119 |
+
|[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
|
120 |
+
|[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
|
121 |
+
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
|
122 |
+
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 100,000 | 1.056 |Real|
|
123 |
+
|[NousResearch/dolma-v1_7-305B*](https://huggingface.co/datasets/NousResearch/dolma-v1_7-305B) |First 100,000 | 1.037 |Real|
|
124 |
+
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 100,000 | 1.019 |Real|
|
125 |
+
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 100,000 | 0.998 |Real|
|
126 |
+
|[togethercomputer/RedPajama-Data-V2 en 2023-06](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 100,000 | 0.985|Real|
|
127 |
+
|[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 100,000 | 0.975 |Real|
|
128 |
+
|[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
|
129 |
+
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
|
130 |
+
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
|
131 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
132 |
|
133 |
|
134 |
The classifier aligns with the expectation.
|
135 |
|
136 |
- In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
137 |
+
- For real data, [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), have the highest educational value across all real data.
|
138 |
+
- In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
|
139 |
+
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
140 |
+
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
141 |
+
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
142 |
+
- Web scores the lowest (if no filtering is applied) because it contains all domains.
|