Text Classification
fastText
English
kenhktsui commited on
Commit
00800ee
1 Parent(s): e1bfefc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -33
README.md CHANGED
@@ -9,7 +9,7 @@ pipeline_tag: text-classification
9
  ---
10
  # llm-data-textbook-quality-fasttext-classifer-v2
11
  This classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
12
- Model is built on fasttext - it can classify more than 2000 examples per second in CPU.
13
  This model can classify if a text has high educational value (more explicitly defined then textbook quality). This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
14
  It can be used as a filter for data curation when training a LLM.
15
  There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
@@ -18,7 +18,8 @@ There are 3 labels instead of 2 labels, as it offers higher granularity of educa
18
  - Low (Bottom 25% educational value)
19
 
20
  Please note textbook quality is a subset of high quality.
21
- A detailed report/ paper will follow when more downstream experiments of this classifier become available.
 
22
 
23
 
24
  ## Feedback welcomed!
@@ -93,7 +94,7 @@ To make sure this classifier makes sense, it is applied to various datasets.
93
 
94
  Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
95
 
96
- The score can be interpreted as:
97
  |Educational Value| Category |
98
  |--------|----------|
99
  |2 | High|
@@ -103,39 +104,39 @@ The score can be interpreted as:
103
 
104
  |Dataset | Sampling | Average Educational Value | Type |
105
  |--------------------------------------|---|-------------------|-------|
106
- |[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000 | 1.846 |Synthetic|
107
- |[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000 | 1.668 |Synthetic|
108
- |[HuggingFaceTB/cosmopedia stanford](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.673 |Synthetic|
109
- |[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 10,000| 1.664|Synthetic|
110
- |[HuggingFaceTB/cosmopedia web_samples_v1](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.615 |Synthetic|
111
- |[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000 | 1.581 |Synthetic|Synthetic|
112
- |[HuggingFaceTB/cosmopedia web_samples_v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.559 |Synthetic|
113
- |[HuggingFaceTB/cosmopedia openstax](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.468 |Synthetic|
114
- |[HuggingFaceTB/cosmopedia wikihow](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.420 |Synthetic|
115
- |[HuggingFaceTB/cosmopedia khanacademy](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.378 |Synthetic|
116
- |[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.350 |Synthetic|
117
- |[NousResearch/dolma-v1_7-305B*](https://huggingface.co/datasets/NousResearch/dolma-v1_7-305B) |First 10,000 | 1.290|Real|
118
- |[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 10,000 | 1.256|Real|
119
- |[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 10,000 | 1.237|Real|
120
- |[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 10,000 | 1.156 |Synthetic|
121
- |[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 10,000 | 1.069 |Real|
122
- |[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
123
- |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
124
- |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
125
- |[togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 10,000 | 0.979|Real|
126
- |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
127
- |[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 10,000 | 0.798|Real|
128
-
129
-
130
  \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
131
 
132
 
133
  The classifier aligns with the expectation.
134
 
135
  - In general, the synthetic data has higher education value because they are created with a high educational value by design.
136
- For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
137
- - Textbook category scores the highest, reflecting the effectiveness of this model.
138
- - Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
139
- - Web scores the lowest.
140
-
141
-
 
9
  ---
10
  # llm-data-textbook-quality-fasttext-classifer-v2
11
  This classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
12
+ Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so I can be used **on-the-fly**.
13
  This model can classify if a text has high educational value (more explicitly defined then textbook quality). This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
14
  It can be used as a filter for data curation when training a LLM.
15
  There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
 
18
  - Low (Bottom 25% educational value)
19
 
20
  Please note textbook quality is a subset of high quality.
21
+ A detailed report/ paper will follow when more downstream experiments of this classifier become available.
22
+ The classifier had been applied to various pretraining dataset. See [**Benchmark**](#Benchmark)
23
 
24
 
25
  ## Feedback welcomed!
 
94
 
95
  Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
96
 
97
+ The score can be roughly interpreted as:
98
  |Educational Value| Category |
99
  |--------|----------|
100
  |2 | High|
 
104
 
105
  |Dataset | Sampling | Average Educational Value | Type |
106
  |--------------------------------------|---|-------------------|-------|
107
+ |[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 100,000 | 1.846 |Synthetic|
108
+ |[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 100,000 | 1.673 |Synthetic|
109
+ |[HuggingFaceTB/cosmopedia stanford](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.673 |Synthetic|
110
+ |[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 100,000| 1.663|Synthetic|
111
+ |[HuggingFaceTB/cosmopedia web_samples_v1](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.618 |Synthetic|
112
+ |[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 100,000 | 1.586 |Synthetic|Synthetic|
113
+ |[HuggingFaceTB/cosmopedia web_samples_v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.562 |Synthetic|
114
+ |[HuggingFaceTB/cosmopedia openstax](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.462 |Synthetic|
115
+ |[HuggingFaceTB/cosmopedia wikihow](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.422 |Synthetic|
116
+ |[HuggingFaceTB/cosmopedia khanacademy](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.419 |Synthetic|
117
+ |[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
118
+ |[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
119
+ |[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
120
+ |[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
121
+ |[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
122
+ |[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 100,000 | 1.056 |Real|
123
+ |[NousResearch/dolma-v1_7-305B*](https://huggingface.co/datasets/NousResearch/dolma-v1_7-305B) |First 100,000 | 1.037 |Real|
124
+ |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 100,000 | 1.019 |Real|
125
+ |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 100,000 | 0.998 |Real|
126
+ |[togethercomputer/RedPajama-Data-V2 en 2023-06](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 100,000 | 0.985|Real|
127
+ |[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 100,000 | 0.975 |Real|
128
+ |[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
129
+ |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
130
+ |[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
131
  \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
132
 
133
 
134
  The classifier aligns with the expectation.
135
 
136
  - In general, the synthetic data has higher education value because they are created with a high educational value by design.
137
+ - For real data, [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), have the highest educational value across all real data.
138
+ - In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
139
+ - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
140
+ - Maths/ paper category scores the second highest, because of its density of knowledge.
141
+ - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
142
+ - Web scores the lowest (if no filtering is applied) because it contains all domains.