Text Classification
fastText
English
kenhktsui commited on
Commit
9fb5293
1 Parent(s): f0e676b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -158,3 +158,36 @@ The classifier aligns with the expectation.
158
  Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
159
  - They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
160
  - The model does not perform well enough to tell educational value in instruction datasets.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
159
  - They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
160
  - The model does not perform well enough to tell educational value in instruction datasets.
161
+
162
+ # 📈Analysis
163
+ ## 🤖Model training with and without classifier
164
+ The expectation is that the model trained with filter will outperform model trained without the filter.
165
+ Fineweb is filtered on the fly with Educational Value >= 1.
166
+
167
+ Test 1:
168
+ Model params: 192M
169
+ Training token: 3.1B training token, 6000 global steps
170
+
171
+ |Task | Training on FineWeb With Filtering | Training on FineWeb Without Filtering | Training with [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)|
172
+ |--------------------------------------|---|---|---|
173
+ |arc-easy | 37.37 | 34.97| 37.45 |
174
+ |arc-challenge | 23.55 |22.95| 23.21 |
175
+ |Hellaswag | 28.02| 27.92 | 27.78|
176
+ |MMLU | 24.71 | 23.94 | 24.65 |
177
+ |TruthfulQA| 45.88 | 45.20| 45.97|
178
+ |Winogrande| 49.49 | 50.59 | 50.67 |
179
+
180
+ The reasoning and commensense reasoning seems to be better when class, aligning with expectation. It is also close to Cosmopedia.
181
+ MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
182
+ Model of larger size will be trained to further validate this claim.
183
+
184
+ (To be updated with a larger model soon)
185
+
186
+ ## 🌐Domain Name Analysis
187
+ The expectation is that most educational value comes from website of universities/ schools, research institutes and organisations.
188
+ Since [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) contains the url of website crawled, the average educational value of each domain name is calculated.
189
+ The first 10M records have been analysed. Full file in [here](https://drive.google.com/file/d/1WnOEH7IwfLJba2CuY207JY6s5hcW1gZQ/view?usp=sharing).
190
+
191
+ Below is the top 100 domain names, with no of record >= 100.
192
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)
193
+