CCI3-HQ-Classifier

Model summary

This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 145k annotations generated by Qwen2-72B-instruct for web samples from CCI3 dataset.

We used this classifier to build CCI3-HQ dataset.

How to use in transformers

To load the classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("BAAI/cci3-hq-classifier")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/cci3-hq-classifier")

text = "曾巩:为人廉洁奉公,才华横溢,关心民间疾苦曾巩,字子固,是我国北宋时期著名的文学家,政治家和教育家。他的一生政绩颇丰,为百姓们做出了许多的好事,在文学创作上他又是北宋诗文革新的主要人物。他文章写得耐人寻味,表露了自己的真情实感。被后人称之为 唐宋八大家之一 。"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
    "text": text,
    "score": score
}

print(result)

Training

The classifier was trained on 145,000 pairs of web samples and their scores from 0 to 5, generated by Qwen2. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.

The prompt used for annotation mostly reuses FineWeb-edu prompt.

We added a classification head with a single regression output to BGE-M3 and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3.

Training Details:

Model: BGE-M3 with a classification head
Dataset: 145,000 samples from Qwen2 annotations
Epochs: 20
Learning Rate: 3e-4
Evaluation Metric: F1 score

Classification report

We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 1500 Qwen2-annotated samples.

              precision    recall  f1-score   support

           0       0.76      0.58      0.66      3890  
           1       0.55      0.62      0.58      4896  
           2       0.40      0.51      0.45      2703  
           3       0.38      0.42      0.40      1536  
           4       0.59      0.27      0.37       972
           5       0.33      0.06      0.10        83

    accuracy                           0.54     14080 
   macro avg       0.50      0.41      0.43     14080 
weighted avg       0.56      0.54      0.54     14080

Confusion matrix

We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.

         2244 1514  126    6    0    0
          690 3035 1049  117    5    0
y_true     24  878 1383  398   20    0
            0  118  651  643  124    0
            1   13  202  482  264   10
            0    0    6   39   33    5
                    y_pred

Limitations

While the CCI3-HQ classifier performs well in distinguishing high-quality educational content for CCI3 dataset, there are some limitations:

Scope: The model's performance may vary across different datasets, particularly when applied to out-of-distribution samples. It is specifically designed to handle educational content related to primary and grade school levels and may exhibit lower performance on content intended for higher education or specialized domains.
Bias: The model's performance relies on the quality and representativeness of both the training data and the LLM used for annotation. Biases in either can influence the classifier's decisions. There is a risk of overfitting to content that appears more academic, leading to higher scores. We recommend using an int_score >= 3 as a threshold for data curation.
Context: The classifier operates by evaluating individual web pages or extracts without considering the broader context, which may limit its effectiveness in certain scenarios.

The training and inference code is available on GitHub https://github.com/FlagAI-Open/FlagAI/tree/master/examples/CCI3-HQ