kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

Model card Files Files and versions Community

kenhktsui commited on May 25

Commit

9fb5293

•

1 Parent(s): f0e676b

Update README.md

Files changed (1) hide show

README.md +33 -0

README.md CHANGED Viewed

@@ -158,3 +158,36 @@ The classifier aligns with the expectation.
 Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
 - They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
 - The model does not perform well enough to tell educational value in instruction datasets.

 Some instruction datasets are added for curiosity sake although model is not trained on instruction data. There are two possible interpretations:
 - They score less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
 - The model does not perform well enough to tell educational value in instruction datasets.
+# 📈Analysis
+## 🤖Model training with and without classifier
+The expectation is that the model trained with filter will outperform model trained without the filter.
+Fineweb is filtered on the fly with Educational Value >= 1.
+Test 1:
+Model params: 192M
+Training token: 3.1B training token, 6000 global steps
+|Task | Training on FineWeb With Filtering | Training on FineWeb Without Filtering | Training with [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)|
+|--------------------------------------|---|---|---|
+|arc-easy | 37.37 | 34.97| 37.45 |
+|arc-challenge | 23.55 |22.95| 23.21 |
+|Hellaswag | 28.02| 27.92 | 27.78|
+|MMLU | 24.71 | 23.94 | 24.65 |
+|TruthfulQA| 45.88 | 45.20| 45.97|
+|Winogrande| 49.49 | 50.59 | 50.67 |
+The reasoning and commensense reasoning seems to be better when class, aligning with expectation. It is also close to Cosmopedia.
+MMLU is better also; however it is close to random due to limitation in compute (both training time and model size).
+Model of larger size will be trained to further validate this claim.
+(To be updated with a larger model soon)
+## 🌐Domain Name Analysis
+The expectation is that most educational value comes from website of universities/ schools, research institutes and organisations.
+Since [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) contains the url of website crawled, the average educational value of each domain name is calculated.
+The first 10M records have been analysed.  Full file in [here](https://drive.google.com/file/d/1WnOEH7IwfLJba2CuY207JY6s5hcW1gZQ/view?usp=sharing).
+Below is the top 100 domain names, with no of record >= 100.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/3QNYYVbFIqaAUh-574lED.png)