File size: 2,506 Bytes
2731900
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac6ad55
2731900
ac6ad55
2731900
 
 
690bdf8
2731900
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac6ad55
2731900
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac6ad55
2731900
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- ModernBERT
- fineweb
- filtering
- regression
metrics:
- precision
- recall
- accuracy
model-index:
- name: 8e-5_one_label
  results: []
datasets:
- HuggingFaceFW/fineweb-edu-llama3-annotations
language:
- en
---

One-off run using a [modified version](https://gist.github.com/bclavie/93d3b161d7fb41131bca41a50b6726c5) of the original Fineweb-Edu quality filter regression training code, simply replacing the original model (snowflake-embed-m, a model fine-tuned on BERT-base) with ModernBERT-base.

w/o extensive tuning, the model trains considerably faster than BERT-base, and gets **+5 Weighted F1**:

# Results

## ModernBERT-base-fineweb-edu-example

**Weighted F1: 0.76**

**Detailed:**

```
Validation Report:
              precision    recall  f1-score   support

           0       0.80      0.55      0.65      5694
           1       0.82      0.86      0.84     26512
           2       0.64      0.71      0.67     10322
           3       0.65      0.60      0.63      3407
           4       0.80      0.37      0.51       807
           5       0.00      0.00      0.00         1

    accuracy                           0.76     46743
   macro avg       0.62      0.51      0.55     46743
weighted avg       0.76      0.76      0.76     46743
```

## Original Classifier (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier):

**Weighted F1: 0.71**

**Detailed:**

```
              precision    recall  f1-score   support

           0       0.75      0.49      0.59      5694
           1       0.78      0.84      0.81     26512
           2       0.57      0.61      0.59     10322
           3       0.56      0.50      0.53      3407
           4       0.58      0.35      0.44       807
           5       0.33      0.01      0.02       125

    accuracy                           0.71     46867
   macro avg       0.60      0.47      0.50     46867
weighted avg       0.71      0.71      0.71     46867
```

(for some reason, the currently available annotated dataset is identical, except that it's missing 124 of the 125 5-rated examples. These are so anecdotal they have no real impact on the weighted metrics.)

# Params

Most parameters detailed in the script. Key hparams:

- **Learning Rate**: 5e-5
- **Weight Decay**: 0.1 (decoupled)
- **Seed**: 1
- **Warmup**: 10% steps
- **Schedule**: Linear decay
- **Max epochs**: 10
- **Best Epoch**: #3
- **Precision**: bfloat16