Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,104 @@
|
|
1 |
-
---
|
2 |
-
license: odc-by
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: odc-by
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
library_name: fasttext
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
datasets:
|
8 |
+
- HuggingFaceFW/fineweb-edu-llama3-annotations
|
9 |
+
---
|
10 |
+
# FineWeb-Edu FastText classifier
|
11 |
+
|
12 |
+
## Model summary
|
13 |
+
This is a FastText classifier for judging the educational value of web pages based on training data [fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations).
|
14 |
+
There are two objectives:
|
15 |
+
- ⚡ throughput optimisation: It can classify more than 2000 examples per second in CPU, and so it can be used on-the-fly during pretraining/ to process huge data with CPU.
|
16 |
+
- 🧪fasttext vs transformer based model: How does this lightweight model with limited capacity compare to the original model [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)?
|
17 |
+
|
18 |
+
|
19 |
+
## 🛠️Usage
|
20 |
+
```
|
21 |
+
from typing import List
|
22 |
+
import re
|
23 |
+
from huggingface_hub import hf_hub_download
|
24 |
+
import fasttext
|
25 |
+
|
26 |
+
|
27 |
+
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))
|
28 |
+
|
29 |
+
|
30 |
+
def replace_newlines(text: str) -> str:
|
31 |
+
return re.sub("\n+", " ", text)
|
32 |
+
|
33 |
+
|
34 |
+
def predict(text_list: List[str]) -> List[dict]:
|
35 |
+
text_list = [replace_newlines(text) for text in text_list]
|
36 |
+
pred = model_hf.predict(text_list)
|
37 |
+
return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
|
38 |
+
for l, s in zip(*pred)]
|
39 |
+
|
40 |
+
|
41 |
+
predict(["Hi"])
|
42 |
+
# Output: [{'label': 0, 'score': 1.00001}]
|
43 |
+
|
44 |
+
```
|
45 |
+
|
46 |
+
## 📊Evaluation
|
47 |
+
The last 46867 samples are used as test data, but it's not the exact test data as in [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
|
48 |
+
### Classification Report
|
49 |
+
```
|
50 |
+
precision recall f1-score support
|
51 |
+
|
52 |
+
0 0.72 0.44 0.55 5704
|
53 |
+
1 0.73 0.87 0.80 26595
|
54 |
+
2 0.52 0.49 0.50 10350
|
55 |
+
3 0.48 0.33 0.39 3397
|
56 |
+
4 0.69 0.03 0.06 819
|
57 |
+
5 0.00 0.00 0.00 2
|
58 |
+
|
59 |
+
accuracy 0.68 46867
|
60 |
+
macro avg 0.52 0.36 0.38 46867
|
61 |
+
weighted avg 0.67 0.68 0.66 46867
|
62 |
+
```
|
63 |
+
|
64 |
+
The below table compares FastText model vs transformer based model.
|
65 |
+
|
66 |
+
Label|This Model| [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
|
67 |
+
-----|-----|----
|
68 |
+
0|0.55 | 0.59
|
69 |
+
1|0.80 | 0.81
|
70 |
+
2|0.50 | 0.59
|
71 |
+
3|0.39 | 0.53
|
72 |
+
4|0.06 | 0.44
|
73 |
+
5|0.00 | 0.02
|
74 |
+
|
75 |
+
Label 0, 1, 2 are comparable to the original model.
|
76 |
+
The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fasttext model.
|
77 |
+
So, this classifer can perform reasonably well in label 0, 1, 2, and also 3 with some degradation.
|
78 |
+
|
79 |
+
### Confusion Matrix
|
80 |
+
```
|
81 |
+
[ 2537 3098 65 4 0 0]
|
82 |
+
[ 944 23037 2491 123 0 0]
|
83 |
+
y_true [ 26 4742 5048 533 1 0]
|
84 |
+
[ 4 434 1846 1105 8 0]
|
85 |
+
[ 0 38 213 544 24 0]
|
86 |
+
[ 0 0 0 0 2 0]
|
87 |
+
y_pred
|
88 |
+
```
|
89 |
+
|
90 |
+
The model has a accuracy of 68%, and it is more likely to underpredict educational value than overpredict so. The exhibited conservatism is good for filtering large amount of data.
|
91 |
+
|
92 |
+
Predicted - Actual Rating | Frequency | %
|
93 |
+
-----|-----|----
|
94 |
+
0|31751 | 67.7%
|
95 |
+
-1|8078 | 17.2%
|
96 |
+
+1| 6130 | 13.1%
|
97 |
+
-2|673 | 1.4%
|
98 |
+
+2|189 | 0.4%
|
99 |
+
-3|42 | 0.1%
|
100 |
+
+3|4 | 0.0%
|
101 |
+
|
102 |
+
|
103 |
+
### Alignment with [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
|
104 |
+
Spearman rank-order correlation coefficient is 0.5832 in MiniPile test split.
|