kenhktsui
/

fineweb-edu-fasttext-classifier

Text Classification

fastText

English

Model card Files Files and versions Community

kenhktsui commited on Jun 6

Commit

45d2335

•

1 Parent(s): a7737c1

Update README.md

Browse files

Files changed (1) hide show

README.md +104 -3

README.md CHANGED Viewed

@@ -1,3 +1,104 @@
----
-license: odc-by
----

+---
+license: odc-by
+language:
+- en
+library_name: fasttext
+pipeline_tag: text-classification
+datasets:
+- HuggingFaceFW/fineweb-edu-llama3-annotations
+---
+# FineWeb-Edu FastText classifier
+## Model summary
+This is a FastText classifier for judging the educational value of web pages based on training data [fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations).
+There are two objectives:
+- ⚡ throughput optimisation: It can classify more than 2000 examples per second in CPU, and so it can be used on-the-fly during pretraining/ to process huge data with CPU.
+- 🧪fasttext vs transformer based model: How does this lightweight model with limited capacity compare to the original model [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)?
+## 🛠️Usage
+```
+from typing import List
+import re
+from huggingface_hub import hf_hub_download
+import fasttext
+model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))
+def replace_newlines(text: str) -> str:
+  return re.sub("\n+", " ", text)
+def predict(text_list: List[str]) -> List[dict]:
+  text_list = [replace_newlines(text) for text in text_list]
+  pred = model_hf.predict(text_list)
+  return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
+           for l, s in zip(*pred)]
+predict(["Hi"])
+# Output: [{'label': 0, 'score': 1.00001}]
+```
+## 📊Evaluation
+The last 46867 samples are used as test data, but it's not the exact test data as in [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
+### Classification Report
+```
+              precision    recall  f1-score   support
+           0       0.72      0.44      0.55      5704
+           1       0.73      0.87      0.80     26595
+           2       0.52      0.49      0.50     10350
+           3       0.48      0.33      0.39      3397
+           4       0.69      0.03      0.06       819
+           5       0.00      0.00      0.00         2
+    accuracy                           0.68     46867
+   macro avg       0.52      0.36      0.38     46867
+weighted avg       0.67      0.68      0.66     46867
+```
+The below table compares FastText model vs transformer based model.
+Label|This Model| [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
+-----|-----|----
+0|0.55 | 0.59
+1|0.80 | 0.81
+2|0.50 | 0.59
+3|0.39 | 0.53
+4|0.06 | 0.44
+5|0.00 | 0.02
+Label 0, 1, 2 are comparable to the original model.
+The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fasttext model.
+So, this classifer can perform reasonably well in label 0, 1, 2, and also 3 with some degradation.
+### Confusion Matrix
+```
+       [ 2537  3098    65     4     0     0]
+       [  944 23037  2491   123     0     0]
+y_true [   26  4742  5048   533     1     0]
+       [    4   434  1846  1105     8     0]
+       [    0    38   213   544    24     0]
+       [    0     0     0     0     2     0]
+                       y_pred
+```
+The model has a accuracy of 68%, and it is more likely to underpredict educational value than overpredict so. The exhibited conservatism is good for filtering large amount of data.
+Predicted - Actual Rating | Frequency | %
+-----|-----|----
+0|31751 | 67.7%
+-1|8078 | 17.2%
++1| 6130 | 13.1%
+-2|673 | 1.4%
++2|189 | 0.4%
+-3|42 | 0.1%
++3|4 | 0.0%
+### Alignment with [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
+Spearman rank-order correlation coefficient is 0.5832 in MiniPile test split.