metadata
license: cc-by-sa-4.0
language:
- hr
- bs
- sr
datasets:
- classla/xlm-r-bertic-data
XLM-R-BERTić
This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages using XLM-R-BERTić dataset
Benchmarking
Three tasks were chosen for model evaluation:
- Named Entity Recognition (NER)
- Sentiment regression
- COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
NER
Average macro-F1 scores from three runs were used to evaluate performance. Datasets used: hr500k, ReLDI-sr, ReLDI-hr, and SETimes.SR.
system | dataset | F1 score |
---|---|---|
XLM-R-BERTić | hr500k | 0.927 |
BERTić | hr500k | 0.925 |
XLM-R-SloBERTić | hr500k | 0.923 |
XLM-Roberta-Large | hr500k | 0.919 |
crosloengual-bert | hr500k | 0.918 |
XLM-Roberta-Base | hr500k | 0.903 |
system | dataset | F1 score |
---|---|---|
XLM-R-SloBERTić | ReLDI-hr | 0.812 |
XLM-R-BERTić | ReLDI-hr | 0.809 |
crosloengual-bert | ReLDI-hr | 0.794 |
BERTić | ReLDI-hr | 0.792 |
XLM-Roberta-Large | ReLDI-hr | 0.791 |
XLM-Roberta-Base | ReLDI-hr | 0.763 |
system | dataset | F1 score |
---|---|---|
XLM-R-SloBERTić | SETimes.SR | 0.949 |
XLM-R-BERTić | SETimes.SR | 0.940 |
BERTić | SETimes.SR | 0.936 |
XLM-Roberta-Large | SETimes.SR | 0.933 |
crosloengual-bert | SETimes.SR | 0.922 |
XLM-Roberta-Base | SETimes.SR | 0.914 |
system | dataset | F1 score |
---|---|---|
XLM-R-BERTić | ReLDI-sr | 0.841 |
XLM-R-SloBERTić | ReLDI-sr | 0.824 |
BERTić | ReLDI-sr | 0.798 |
XLM-Roberta-Large | ReLDI-sr | 0.774 |
crosloengual-bert | ReLDI-sr | 0.751 |
XLM-Roberta-Base | ReLDI-sr | 0.734 |
Sentiment regression
ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated benchmarking repository.
system | train | test | r^2 |
---|---|---|---|
xlm-r-parlasent | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
XLM-R-BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
crosloengual-bert | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
COPA
system | dataset | Accuracy score |
---|---|---|
BERTić | Copa-SR | 0.689 |
XLM-R-SloBERTić | Copa-SR | 0.665 |
XLM-R-BERTić | Copa-SR | 0.637 |
crosloengual-bert | Copa-SR | 0.607 |
XLM-Roberta-Base | Copa-SR | 0.573 |
XLM-Roberta-Large | Copa-SR | 0.570 |
system | dataset | Accuracy score |
---|---|---|
BERTić | Copa-HR | 0.669 |
XLM-R-SloBERTić | Copa-HR | 0.628 |
XLM-R-BERTić | Copa-HR | 0.635 |
crosloengual-bert | Copa-HR | 0.669 |
XLM-Roberta-Base | Copa-HR | 0.585 |
XLM-Roberta-Large | Copa-HR | 0.571 |
Citation
Please cite the following paper:
@inproceedings{ljubesic-etal-2024-language,
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Suchomel, V{\'\i}t and
Rupnik, Peter and
Kuzman, Taja and
van Noord, Rik",
editor = "Melero, Maite and
Sakti, Sakriani and
Soria, Claudia",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.23",
pages = "189--203",
}