Update README.md
Browse files
README.md
CHANGED
@@ -158,3 +158,36 @@ The classifier aligns with the expectation.
|
|
158 |
Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
|
159 |
- They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
|
160 |
- The model does not perform well enough to tell educational value in instruction datasets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
158 |
Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
|
159 |
- They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
|
160 |
- The model does not perform well enough to tell educational value in instruction datasets.
|
161 |
+
|
162 |
+
# 📈Analysis
|
163 |
+
## 🤖Model training with and without classifier
|
164 |
+
The expectation is that the model trained with filter will outperform model trained without the filter.
|
165 |
+
Fineweb is filtered on the fly with Educational Value >= 1.
|
166 |
+
|
167 |
+
Test 1:
|
168 |
+
Model params: 192M
|
169 |
+
Training token: 3.1B training token, 6000 global steps
|
170 |
+
|
171 |
+
|Task | Training on FineWeb With Filtering | Training on FineWeb Without Filtering | Training with [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)|
|
172 |
+
|--------------------------------------|---|---|---|
|
173 |
+
|arc-easy | 37.37 | 34.97| 37.45 |
|
174 |
+
|arc-challenge | 23.55 |22.95| 23.21 |
|
175 |
+
|Hellaswag | 28.02| 27.92 | 27.78|
|
176 |
+
|MMLU | 24.71 | 23.94 | 24.65 |
|
177 |
+
|TruthfulQA| 45.88 | 45.20| 45.97|
|
178 |
+
|Winogrande| 49.49 | 50.59 | 50.67 |
|
179 |
+
|
180 |
+
The reasoning and commensense reasoning seems to be better when class, aligning with expectation. It is also close to Cosmopedia.
|
181 |
+
MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
|
182 |
+
Model of larger size will be trained to further validate this claim.
|
183 |
+
|
184 |
+
(To be updated with a larger model soon)
|
185 |
+
|
186 |
+
## 🌐Domain Name Analysis
|
187 |
+
The expectation is that most educational value comes from website of universities/ schools, research institutes and organisations.
|
188 |
+
Since [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) contains the url of website crawled, the average educational value of each domain name is calculated.
|
189 |
+
The first 10M records have been analysed. Full file in [here](https://drive.google.com/file/d/1WnOEH7IwfLJba2CuY207JY6s5hcW1gZQ/view?usp=sharing).
|
190 |
+
|
191 |
+
Below is the top 100 domain names, with no of record >= 100.
|
192 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)
|
193 |
+
|