Update README.md

8b1f3e3 verified about 20 hours ago

6.01 kB

	---
	license: mit
	---
	# NB Linguistic Quality Regressor

	## Introduction

	This model is designed to rate the quality of Norwegian training corpora based on linguistic quality. It predicts a continuous score (float from 0 to 5), assessing the linguistic quality of Norwegian texts. The model is inspired by the classifiers used in the FineWeb project and is trained mainly on Norwegian content.

	## Model Architecture

	It is trained on top of the [nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base) model and utilizes code from [CosmoPedia](https://github.com/huggingface/cosmopedia/tree/main/classification).

	## Training Data

	The dataset used for training is derived from [GlotCC](https://huggingface.co/datasets/cis-lmu/GlotCC-V1) and has been annotated using Gemini 1.5 Flash.

	## Purpose

	The performance of large language models (LLMs) heavily depends on the quality and size of their pretraining datasets. This regressor aims to assess and enhance the linguistic quality of Norwegian textual data, contributing to better-performing Norwegian LLMs.

	This model is part of a pair; the other is the [NB Education Quality Regressor](https://huggingface.co/NbAiLab/nb-education-quality-regressor), which focuses on educational content.


	## Using the Model
	For convenience we also provide the `run_regressor_bert.py` script. This is also based on `run_edu_bert.py` from Cosmopedia. You can modify this script to annotate HuggingFace datasets directly. Cosmopedia also provides slurm-scripts here. We have not included these since we have had the opportunity to test them.


	## Training and Evaluation Procedure
	The following command where used for training. Please note that `train_regressor_bert.py` has a few minor changes to the original `train_edu_bert.py`:
	```
	python train_regressor_bert.py --base_model_name="NbAiLab/nb-bert-base" --dataset_name="user/linguistic-annotations" --target_column="score" --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/"
	```

	The following script where used for evaluation.
	```
	python eval_regressor_bert.py --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/final/" --dataset_name="user/linguistic-annotations"
	```


	## Classification Report

	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| 0 \| 0.84 \| 0.60 \| 0.70 \| 12209 \|
	\| 1 \| 0.70 \| 0.72 \| 0.71 \| 24316 \|
	\| 2 \| 0.41 \| 0.49 \| 0.44 \| 10499 \|
	\| 3 \| 0.38 \| 0.51 \| 0.43 \| 5833 \|
	\| 4 \| 0.10 \| 0.24 \| 0.14 \| 1342 \|
	\| 5 \| 0.87 \| 0.39 \| 0.54 \| 5656 \|

	### Overall Metrics

	\| Metric \| Value \|
	\|---------------\|-------\|
	\| Accuracy \| 0.59 \|
	\| Macro Avg \| \|
	\| - Precision \| 0.55 \|
	\| - Recall \| 0.49 \|
	\| - F1-score \| 0.50 \|
	\| Weighted Avg \| \|
	\| - Precision \| 0.65 \|
	\| - Recall \| 0.59 \|
	\| - F1-score \| 0.61 \|
	\| Support \| 59855 \|


	## Confusion Matrix

	\| \| Predicted 0 \| Predicted 1 \| Predicted 2 \| Predicted 3 \| Predicted 4 \| Predicted 5 \|
	\|-------\|--------------\|--------------\|--------------\|--------------\|--------------\|--------------\|
	\| Actual 0 \| 7318 \| 4278 \| 529 \| 63 \| 19 \| 2 \|
	\| Actual 1 \| 1364 \| 17602 \| 4414 \| 785 \| 135 \| 16 \|
	\| Actual 2 \| 38 \| 2615 \| 5130 \| 2289 \| 369 \| 58 \|
	\| Actual 3 \| 10 \| 333 \| 1726 \| 2952 \| 664 \| 148 \|
	\| Actual 4 \| 3 \| 83 \| 350 \| 476 \| 324 \| 106 \|
	\| Actual 5 \| 6 \| 98 \| 479 \| 1205 \| 1639 \| 2229 \|

	## Evaluation Metrics

	\| Metric \| Value \|
	\|--------------------------\|-----------------------\|
	\| Eval Loss \| 0.673861563205719 \|
	\| Eval Precision \| 0.5502142676492386 \|
	\| Eval Recall \| 0.49225148166352145 \|
	\| Eval F1 Macro \| 0.49616318856882935 \|
	\| Eval Accuracy \| 0.5940188789574806 \|
	\| Eval Runtime \| 285.9726 \|
	\| Eval Samples per Second \| 209.303 \|
	\| Eval Steps per Second \| 3.273 \|
	\| Epoch \| 19.96 \|


	## Training Runtime

	\| Metric \| Value \|
	\|--------------------------\|-----------------------\|
	\| Train Runtime \| 105056.8322 \|
	\| Train Samples per Second \| 102.552 \|
	\| Train Steps per Second \| 1.603 \|
	\| Train Loss \| 0.6785072675819606 \|
	\| Epoch \| 20.0 \|

	### Run Summary

	\| Metric \| Value \|
	\|----------------------------\|-----------------------\|
	\| Eval Accuracy \| 0.59402 \|
	\| Eval F1 Macro \| 0.49616 \|
	\| Eval Loss \| 0.67386 \|
	\| Eval Precision \| 0.55021 \|
	\| Eval Recall \| 0.49225 \|
	\| Eval Runtime \| 285.9726 \|
	\| Eval Samples per Second \| 209.303 \|
	\| Eval Steps per Second \| 3.273 \|
	\| Total FLOPs \| 2.8346790572921083e+18\|
	\| Train Epoch \| 20.0 \|
	\| Train Global Step \| 168360 \|
	\| Train Grad Norm \| 2.77268 \|
	\| Train Learning Rate \| 0.0 \|
	\| Train Loss \| 0.6201 \|
	\| Train Loss (Final) \| 0.67851 \|
	\| Train Runtime \| 105056.8322 \|
	\| Train Samples per Second \| 102.552 \|
	\| Train Steps per Second \| 1.603 \|

	## Citing & Authors
	The model was trained and documentation written by Per Egil Kummervold.

	---
	license: mit
	---
	# NB Linguistic Quality Regressor

	## Introduction

	This model is designed to rate the quality of Norwegian training corpora based on linguistic quality. It predicts a continuous score (float from 0 to 5), assessing the linguistic quality of Norwegian texts. The model is inspired by the classifiers used in the FineWeb project and is trained mainly on Norwegian content.

	## Model Architecture

	It is trained on top of the [nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base) model and utilizes code from [CosmoPedia](https://github.com/huggingface/cosmopedia/tree/main/classification).

	## Training Data

	The dataset used for training is derived from [GlotCC](https://huggingface.co/datasets/cis-lmu/GlotCC-V1) and has been annotated using Gemini 1.5 Flash.

	## Purpose

	The performance of large language models (LLMs) heavily depends on the quality and size of their pretraining datasets. This regressor aims to assess and enhance the linguistic quality of Norwegian textual data, contributing to better-performing Norwegian LLMs.

	This model is part of a pair; the other is the [NB Education Quality Regressor](https://huggingface.co/NbAiLab/nb-education-quality-regressor), which focuses on educational content.


	## Using the Model
	For convenience we also provide the `run_regressor_bert.py` script. This is also based on `run_edu_bert.py` from Cosmopedia. You can modify this script to annotate HuggingFace datasets directly. Cosmopedia also provides slurm-scripts here. We have not included these since we have had the opportunity to test them.


	## Training and Evaluation Procedure
	The following command where used for training. Please note that `train_regressor_bert.py` has a few minor changes to the original `train_edu_bert.py`:
	```
	python train_regressor_bert.py --base_model_name="NbAiLab/nb-bert-base" --dataset_name="user/linguistic-annotations" --target_column="score" --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/"
	```

	The following script where used for evaluation.
	```
	python eval_regressor_bert.py --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/final/" --dataset_name="user/linguistic-annotations"
	```


	## Classification Report

	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| 0 \| 0.84 \| 0.60 \| 0.70 \| 12209 \|
	\| 1 \| 0.70 \| 0.72 \| 0.71 \| 24316 \|
	\| 2 \| 0.41 \| 0.49 \| 0.44 \| 10499 \|
	\| 3 \| 0.38 \| 0.51 \| 0.43 \| 5833 \|
	\| 4 \| 0.10 \| 0.24 \| 0.14 \| 1342 \|
	\| 5 \| 0.87 \| 0.39 \| 0.54 \| 5656 \|

	### Overall Metrics

	\| Metric \| Value \|
	\|---------------\|-------\|
	\| Accuracy \| 0.59 \|
	\| Macro Avg \| \|
	\| - Precision \| 0.55 \|
	\| - Recall \| 0.49 \|
	\| - F1-score \| 0.50 \|
	\| Weighted Avg \| \|
	\| - Precision \| 0.65 \|
	\| - Recall \| 0.59 \|
	\| - F1-score \| 0.61 \|
	\| Support \| 59855 \|


	## Confusion Matrix

	\| \| Predicted 0 \| Predicted 1 \| Predicted 2 \| Predicted 3 \| Predicted 4 \| Predicted 5 \|
	\|-------\|--------------\|--------------\|--------------\|--------------\|--------------\|--------------\|
	\| Actual 0 \| 7318 \| 4278 \| 529 \| 63 \| 19 \| 2 \|
	\| Actual 1 \| 1364 \| 17602 \| 4414 \| 785 \| 135 \| 16 \|
	\| Actual 2 \| 38 \| 2615 \| 5130 \| 2289 \| 369 \| 58 \|
	\| Actual 3 \| 10 \| 333 \| 1726 \| 2952 \| 664 \| 148 \|
	\| Actual 4 \| 3 \| 83 \| 350 \| 476 \| 324 \| 106 \|
	\| Actual 5 \| 6 \| 98 \| 479 \| 1205 \| 1639 \| 2229 \|

	## Evaluation Metrics

	\| Metric \| Value \|
	\|--------------------------\|-----------------------\|
	\| Eval Loss \| 0.673861563205719 \|
	\| Eval Precision \| 0.5502142676492386 \|
	\| Eval Recall \| 0.49225148166352145 \|
	\| Eval F1 Macro \| 0.49616318856882935 \|
	\| Eval Accuracy \| 0.5940188789574806 \|
	\| Eval Runtime \| 285.9726 \|
	\| Eval Samples per Second \| 209.303 \|
	\| Eval Steps per Second \| 3.273 \|
	\| Epoch \| 19.96 \|


	## Training Runtime

	\| Metric \| Value \|
	\|--------------------------\|-----------------------\|
	\| Train Runtime \| 105056.8322 \|
	\| Train Samples per Second \| 102.552 \|
	\| Train Steps per Second \| 1.603 \|
	\| Train Loss \| 0.6785072675819606 \|
	\| Epoch \| 20.0 \|

	### Run Summary

	\| Metric \| Value \|
	\|----------------------------\|-----------------------\|
	\| Eval Accuracy \| 0.59402 \|
	\| Eval F1 Macro \| 0.49616 \|
	\| Eval Loss \| 0.67386 \|
	\| Eval Precision \| 0.55021 \|
	\| Eval Recall \| 0.49225 \|
	\| Eval Runtime \| 285.9726 \|
	\| Eval Samples per Second \| 209.303 \|
	\| Eval Steps per Second \| 3.273 \|
	\| Total FLOPs \| 2.8346790572921083e+18\|
	\| Train Epoch \| 20.0 \|
	\| Train Global Step \| 168360 \|
	\| Train Grad Norm \| 2.77268 \|
	\| Train Learning Rate \| 0.0 \|
	\| Train Loss \| 0.6201 \|
	\| Train Loss (Final) \| 0.67851 \|
	\| Train Runtime \| 105056.8322 \|
	\| Train Samples per Second \| 102.552 \|
	\| Train Steps per Second \| 1.603 \|

	## Citing & Authors
	The model was trained and documentation written by Per Egil Kummervold.