Text Classification
fastText
English
kenhktsui commited on
Commit
5e491b3
1 Parent(s): 2962a95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -3
README.md CHANGED
@@ -161,9 +161,9 @@ Some instruction datasets are added for curiosity sake although model is not tra
161
  - The model does not perform well enough to tell educational value in instruction datasets.
162
 
163
  # 📈Analysis
164
- ## 🤖Model training with and without classifier
165
  The expectation is that the model trained with filter will outperform model trained without the filter.
166
- Fineweb is filtered on the fly with Educational Value >= 1.
167
 
168
  Test 1:
169
  Model params: 192M
@@ -178,7 +178,7 @@ Training token: 3.1B training token, 6000 global steps
178
  |TruthfulQA| 45.88 | 45.20| 45.97|
179
  |Winogrande| 49.49 | 50.59 | 50.67 |
180
 
181
- The reasoning and commensense reasoning seems to be better when class, aligning with expectation. It is also close to Cosmopedia.
182
  MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
183
  Model of larger size will be trained to further validate this claim.
184
 
@@ -192,3 +192,6 @@ The first 10M records have been analysed. Full file in [here](https://drive.goo
192
  Below is the top 100 domain names, with no of record >= 100.
193
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)
194
 
 
 
 
 
161
  - The model does not perform well enough to tell educational value in instruction datasets.
162
 
163
  # 📈Analysis
164
+ ## 🤖Model Training With And Without Classifier
165
  The expectation is that the model trained with filter will outperform model trained without the filter.
166
+ Fineweb is filtered on the fly with Educational Value >= 1.0.
167
 
168
  Test 1:
169
  Model params: 192M
 
178
  |TruthfulQA| 45.88 | 45.20| 45.97|
179
  |Winogrande| 49.49 | 50.59 | 50.67 |
180
 
181
+ The reasoning and commensense reasoning seems to be better when filter is on, aligning with expectation. It is also close to Cosmopedia.
182
  MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
183
  Model of larger size will be trained to further validate this claim.
184
 
 
192
  Below is the top 100 domain names, with no of record >= 100.
193
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)
194
 
195
+ ## 🧪Classifier Ranking Ordering
196
+ Spearman rank-order correlation coefficient between Educational Value and that of test data is 0.7055, indicating a strong monotonic relationship. The Educational Value can be used for ranking.
197
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/dKV2oXRv3WpEsfDXy0bl7.png)