pere's picture
Update README.md
8b1f3e3 verified
metadata
license: mit

NB Linguistic Quality Regressor

Introduction

This model is designed to rate the quality of Norwegian training corpora based on linguistic quality. It predicts a continuous score (float from 0 to 5), assessing the linguistic quality of Norwegian texts. The model is inspired by the classifiers used in the FineWeb project and is trained mainly on Norwegian content.

Model Architecture

It is trained on top of the nb-bert-base model and utilizes code from CosmoPedia.

Training Data

The dataset used for training is derived from GlotCC and has been annotated using Gemini 1.5 Flash.

Purpose

The performance of large language models (LLMs) heavily depends on the quality and size of their pretraining datasets. This regressor aims to assess and enhance the linguistic quality of Norwegian textual data, contributing to better-performing Norwegian LLMs.

This model is part of a pair; the other is the NB Education Quality Regressor, which focuses on educational content.

Using the Model

For convenience we also provide the run_regressor_bert.py script. This is also based on run_edu_bert.py from Cosmopedia. You can modify this script to annotate HuggingFace datasets directly. Cosmopedia also provides slurm-scripts here. We have not included these since we have had the opportunity to test them.

Training and Evaluation Procedure

The following command where used for training. Please note that train_regressor_bert.py has a few minor changes to the original train_edu_bert.py:

 python train_regressor_bert.py --base_model_name="NbAiLab/nb-bert-base" --dataset_name="user/linguistic-annotations" --target_column="score" --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/"

The following script where used for evaluation.

 python eval_regressor_bert.py --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/final/" --dataset_name="user/linguistic-annotations"

Classification Report

Class Precision Recall F1-score Support
0 0.84 0.60 0.70 12209
1 0.70 0.72 0.71 24316
2 0.41 0.49 0.44 10499
3 0.38 0.51 0.43 5833
4 0.10 0.24 0.14 1342
5 0.87 0.39 0.54 5656

Overall Metrics

Metric Value
Accuracy 0.59
Macro Avg
- Precision 0.55
- Recall 0.49
- F1-score 0.50
Weighted Avg
- Precision 0.65
- Recall 0.59
- F1-score 0.61
Support 59855

Confusion Matrix

Predicted 0 Predicted 1 Predicted 2 Predicted 3 Predicted 4 Predicted 5
Actual 0 7318 4278 529 63 19 2
Actual 1 1364 17602 4414 785 135 16
Actual 2 38 2615 5130 2289 369 58
Actual 3 10 333 1726 2952 664 148
Actual 4 3 83 350 476 324 106
Actual 5 6 98 479 1205 1639 2229

Evaluation Metrics

Metric Value
Eval Loss 0.673861563205719
Eval Precision 0.5502142676492386
Eval Recall 0.49225148166352145
Eval F1 Macro 0.49616318856882935
Eval Accuracy 0.5940188789574806
Eval Runtime 285.9726
Eval Samples per Second 209.303
Eval Steps per Second 3.273
Epoch 19.96

Training Runtime

Metric Value
Train Runtime 105056.8322
Train Samples per Second 102.552
Train Steps per Second 1.603
Train Loss 0.6785072675819606
Epoch 20.0

Run Summary

Metric Value
Eval Accuracy 0.59402
Eval F1 Macro 0.49616
Eval Loss 0.67386
Eval Precision 0.55021
Eval Recall 0.49225
Eval Runtime 285.9726
Eval Samples per Second 209.303
Eval Steps per Second 3.273
Total FLOPs 2.8346790572921083e+18
Train Epoch 20.0
Train Global Step 168360
Train Grad Norm 2.77268
Train Learning Rate 0.0
Train Loss 0.6201
Train Loss (Final) 0.67851
Train Runtime 105056.8322
Train Samples per Second 102.552
Train Steps per Second 1.603

Citing & Authors

The model was trained and documentation written by Per Egil Kummervold.