Text Classification
fastText
English
kenhktsui commited on
Commit
45d2335
1 Parent(s): a7737c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -3
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: odc-by
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: odc-by
3
+ language:
4
+ - en
5
+ library_name: fasttext
6
+ pipeline_tag: text-classification
7
+ datasets:
8
+ - HuggingFaceFW/fineweb-edu-llama3-annotations
9
+ ---
10
+ # FineWeb-Edu FastText classifier
11
+
12
+ ## Model summary
13
+ This is a FastText classifier for judging the educational value of web pages based on training data [fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations).
14
+ There are two objectives:
15
+ - ⚡ throughput optimisation: It can classify more than 2000 examples per second in CPU, and so it can be used on-the-fly during pretraining/ to process huge data with CPU.
16
+ - 🧪fasttext vs transformer based model: How does this lightweight model with limited capacity compare to the original model [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)?
17
+
18
+
19
+ ## 🛠️Usage
20
+ ```
21
+ from typing import List
22
+ import re
23
+ from huggingface_hub import hf_hub_download
24
+ import fasttext
25
+
26
+
27
+ model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))
28
+
29
+
30
+ def replace_newlines(text: str) -> str:
31
+ return re.sub("\n+", " ", text)
32
+
33
+
34
+ def predict(text_list: List[str]) -> List[dict]:
35
+ text_list = [replace_newlines(text) for text in text_list]
36
+ pred = model_hf.predict(text_list)
37
+ return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
38
+ for l, s in zip(*pred)]
39
+
40
+
41
+ predict(["Hi"])
42
+ # Output: [{'label': 0, 'score': 1.00001}]
43
+
44
+ ```
45
+
46
+ ## 📊Evaluation
47
+ The last 46867 samples are used as test data, but it's not the exact test data as in [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
48
+ ### Classification Report
49
+ ```
50
+ precision recall f1-score support
51
+
52
+ 0 0.72 0.44 0.55 5704
53
+ 1 0.73 0.87 0.80 26595
54
+ 2 0.52 0.49 0.50 10350
55
+ 3 0.48 0.33 0.39 3397
56
+ 4 0.69 0.03 0.06 819
57
+ 5 0.00 0.00 0.00 2
58
+
59
+ accuracy 0.68 46867
60
+ macro avg 0.52 0.36 0.38 46867
61
+ weighted avg 0.67 0.68 0.66 46867
62
+ ```
63
+
64
+ The below table compares FastText model vs transformer based model.
65
+
66
+ Label|This Model| [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
67
+ -----|-----|----
68
+ 0|0.55 | 0.59
69
+ 1|0.80 | 0.81
70
+ 2|0.50 | 0.59
71
+ 3|0.39 | 0.53
72
+ 4|0.06 | 0.44
73
+ 5|0.00 | 0.02
74
+
75
+ Label 0, 1, 2 are comparable to the original model.
76
+ The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fasttext model.
77
+ So, this classifer can perform reasonably well in label 0, 1, 2, and also 3 with some degradation.
78
+
79
+ ### Confusion Matrix
80
+ ```
81
+ [ 2537 3098 65 4 0 0]
82
+ [ 944 23037 2491 123 0 0]
83
+ y_true [ 26 4742 5048 533 1 0]
84
+ [ 4 434 1846 1105 8 0]
85
+ [ 0 38 213 544 24 0]
86
+ [ 0 0 0 0 2 0]
87
+ y_pred
88
+ ```
89
+
90
+ The model has a accuracy of 68%, and it is more likely to underpredict educational value than overpredict so. The exhibited conservatism is good for filtering large amount of data.
91
+
92
+ Predicted - Actual Rating | Frequency | %
93
+ -----|-----|----
94
+ 0|31751 | 67.7%
95
+ -1|8078 | 17.2%
96
+ +1| 6130 | 13.1%
97
+ -2|673 | 1.4%
98
+ +2|189 | 0.4%
99
+ -3|42 | 0.1%
100
+ +3|4 | 0.0%
101
+
102
+
103
+ ### Alignment with [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
104
+ Spearman rank-order correlation coefficient is 0.5832 in MiniPile test split.