Text Classification
Safetensors
Bulgarian
bert
Not-For-All-Audiences
medical
toxic-bert-bg / README.md
melaniab's picture
Update README.md
8a4ae77 verified
metadata
language:
  - bg
metrics:
  - f1
  - accuracy
  - precision
  - recall
base_model:
  - rmihaylov/bert-base-bg
pipeline_tag: text-classification
license: apache-2.0
datasets:
  - sofia-uni/toxic-data-bg
  - wikimedia/wikipedia
  - oscar-corpus/oscar
  - petkopetkov/chitanka
tags:
  - bert
  - not-for-all-audiences
  - medical

Toxic language classification model of Bulgarian language, based on the bert-base-bg model.

The model classifies between 4 classes: Toxic, MedicalTerminology, NonToxic, MinorityGroup.

Classification report:

Accuracy Precision Recall F1 Score Loss Function
0.85 0.86 0.85 0.85 0.43

More information in the paper.

Code and usage

For training files and information how to use the model, refer to the GitHub repository of the project.

Reference

If you use this model in your academic project, please cite as:

@article
{berbatova2025detecting,
doi={10.13140/RG.2.2.34963.18723}
title={Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text},
author={Berbatova, Melania and Vasev, Tsvetoslav},
year={2025}
}