kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

Model card Files Files and versions Community

kenhktsui commited on May 27

Commit

5e491b3

•

1 Parent(s): 2962a95

Update README.md

Files changed (1) hide show

README.md +6 -3

README.md CHANGED Viewed

@@ -161,9 +161,9 @@ Some instruction datasets are added for curiosity sake although model is not tra
 - The model does not perform well enough to tell educational value in instruction datasets.
 # 📈Analysis
-## 🤖Model training with and without classifier
 The expectation is that the model trained with filter will outperform model trained without the filter.
-Fineweb is filtered on the fly with Educational Value >= 1.
 Test 1:
 Model params: 192M
@@ -178,7 +178,7 @@ Training token: 3.1B training token, 6000 global steps
 |TruthfulQA| 45.88 | 45.20| 45.97|
 |Winogrande| 49.49 | 50.59 | 50.67 |
-The reasoning and commensense reasoning seems to be better when class, aligning with expectation. It is also close to Cosmopedia.
 MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
 Model of larger size will be trained to further validate this claim.
@@ -192,3 +192,6 @@ The first 10M records have been analysed.  Full file in [here](https://drive.goo
 Below is the top 100 domain names, with no of record >= 100.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)

 - The model does not perform well enough to tell educational value in instruction datasets.
 # 📈Analysis
+## 🤖Model Training With And Without Classifier
 The expectation is that the model trained with filter will outperform model trained without the filter.
+Fineweb is filtered on the fly with Educational Value >= 1.0.
 Test 1:
 Model params: 192M
 |TruthfulQA| 45.88 | 45.20| 45.97|
 |Winogrande| 49.49 | 50.59 | 50.67 |
+The reasoning and commensense reasoning seems to be better when filter is on, aligning with expectation. It is also close to Cosmopedia.
 MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
 Model of larger size will be trained to further validate this claim.
 Below is the top 100 domain names, with no of record >= 100.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)
+## 🧪Classifier Ranking Ordering
+Spearman rank-order correlation coefficient between Educational Value and that of test data is 0.7055, indicating a strong monotonic relationship. The Educational Value can be used for ranking.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/dKV2oXRv3WpEsfDXy0bl7.png)