Text Classification
fastText
English
kenhktsui commited on
Commit
ae5e8a4
1 Parent(s): d48877b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -6
README.md CHANGED
@@ -13,10 +13,9 @@ inference: false
13
 
14
  ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
15
 
16
- This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
17
- Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
18
- The model can classify if a text has high educational value (more explicitly defined then textbook quality). The definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
19
- It can be used as a filter for data curation when training a LLM.
20
  There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
21
  - High (Top 25% educational value)
22
  - Mid (Middle 25-75% educational value)
@@ -25,6 +24,8 @@ There are 3 labels instead of 2 labels, as it offers higher granularity of educa
25
  A detailed report/ paper will follow when more downstream experiments of this classifier become available.
26
  The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
27
 
 
 
28
  Please note textbook quality is a subset of high quality.
29
 
30
  ## Feedback welcomed!
@@ -148,7 +149,10 @@ The classifier aligns with the expectation.
148
  - In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
149
  - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
150
  - Maths/ paper category scores the second highest, because of its density of knowledge.
151
- - Instruction dataset scores less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
152
  - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
153
  - Web scores low (if no filtering is applied) because it contains all domains.
154
- - Meme scores the lowest as expected. Hateful memes almost got zero point.
 
 
 
 
 
13
 
14
  ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
15
 
16
+ 📚 The educational value classifier can classify if a text from web has high educational value (more explicitly defined then textbook quality). It is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
17
+ The model is trained on web/ raw text, not on data formatted as instruction dataset (yet).
18
+ It can be used as a filter for pretraining data curation when training a LLM 🤖.
 
19
  There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
20
  - High (Top 25% educational value)
21
  - Mid (Middle 25-75% educational value)
 
24
  A detailed report/ paper will follow when more downstream experiments of this classifier become available.
25
  The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
26
 
27
+ ⚡ Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
28
+
29
  Please note textbook quality is a subset of high quality.
30
 
31
  ## Feedback welcomed!
 
149
  - In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
150
  - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
151
  - Maths/ paper category scores the second highest, because of its density of knowledge.
 
152
  - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
153
  - Web scores low (if no filtering is applied) because it contains all domains.
154
+ - Meme scores the lowest as expected. Hateful memes almost got zero point.
155
+
156
+ Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
157
+ - They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
158
+ - The model does not perform well enough to tell educational value in instruction datasets.