Update README.md
Browse files
README.md
CHANGED
@@ -13,10 +13,9 @@ inference: false
|
|
13 |
|
14 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
It can be used as a filter for data curation when training a LLM.
|
20 |
There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
|
21 |
- High (Top 25% educational value)
|
22 |
- Mid (Middle 25-75% educational value)
|
@@ -25,6 +24,8 @@ There are 3 labels instead of 2 labels, as it offers higher granularity of educa
|
|
25 |
A detailed report/ paper will follow when more downstream experiments of this classifier become available.
|
26 |
The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
|
27 |
|
|
|
|
|
28 |
Please note textbook quality is a subset of high quality.
|
29 |
|
30 |
## Feedback welcomed!
|
@@ -148,7 +149,10 @@ The classifier aligns with the expectation.
|
|
148 |
- In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
|
149 |
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
150 |
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
151 |
-
- Instruction dataset scores less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
|
152 |
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
153 |
- Web scores low (if no filtering is applied) because it contains all domains.
|
154 |
-
- Meme scores the lowest as expected. Hateful memes almost got zero point.
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
15 |
|
16 |
+
📚 The educational value classifier can classify if a text from web has high educational value (more explicitly defined then textbook quality). It is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
|
17 |
+
The model is trained on web/ raw text, not on data formatted as instruction dataset (yet).
|
18 |
+
It can be used as a filter for pretraining data curation when training a LLM 🤖.
|
|
|
19 |
There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
|
20 |
- High (Top 25% educational value)
|
21 |
- Mid (Middle 25-75% educational value)
|
|
|
24 |
A detailed report/ paper will follow when more downstream experiments of this classifier become available.
|
25 |
The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
|
26 |
|
27 |
+
⚡ Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
|
28 |
+
|
29 |
Please note textbook quality is a subset of high quality.
|
30 |
|
31 |
## Feedback welcomed!
|
|
|
149 |
- In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
|
150 |
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
151 |
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
|
|
152 |
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
153 |
- Web scores low (if no filtering is applied) because it contains all domains.
|
154 |
+
- Meme scores the lowest as expected. Hateful memes almost got zero point.
|
155 |
+
|
156 |
+
Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
|
157 |
+
- They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
|
158 |
+
- The model does not perform well enough to tell educational value in instruction datasets.
|