igorsterner's picture
Update README.md
0baa6cf verified
---
license: mit
base_model: xlm-roberta-base
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
metrics:
- f1
---
**⚠️ Warning: An updated version of this model is available [here](https://huggingface.co/segment-any-text/sat-12l-sm) This model is no longer maintained.**
**Please refer to our Segment any Text paper for more details: [https://arxiv.org/abs/2406.16678](https://arxiv.org/abs/2406.16678)**
# xlmr-multilingual-sentence-segmentation
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on a corrupted version of the universal dependency datasets.
It achieves the following results on the (also corrupted) evaluation set:
- Loss: 0.0074
- Precision: 0.9664
- Recall: 0.9677
- F1: 0.9670
# Test set performance
# Results
All results here are percentage F1:
## Opus100 [2]
Who wins most? XLM-RoBERTa: 56, WtPSplit: 12, Spacy (multilingual): 8
| | af | am | ar | az | be | bg | bn | ca | cs | cy | da | de | el | en | eo | es | et | eu | fa | fi | fr | fy | ga | gd | gl | gu | ha | he | hi | hu | hy | id | is | it | ja | ka | kk | km | kn | ko | ku | ky | lt | lv | mg | mk | ml | mn | mr | ms | my | ne | nl | pa | pl | ps | pt | ro | ru | si | sk | sl | sq | sr | sv | ta | te | th | tr | uk | ur | uz | vi | xh | yi | zh |
|:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
| Spacy (multilingual) | 42.61 | 6.69 | 58.52 | 73.59 | 34.78 | 93.74 | 38.04 | 88.76 | 87.70 | 26.30 | 90.52 | 74.15 | 89.75 | 89.25 | 88.77 | 90.95 | 87.26 | 81.20 | 55.40 | 93.28 | 85.77 | 21.49 | 60.61 | 36.83 | 88.77 | 5.59 | **89.39** | **92.21** | 53.33 | 93.26 | 24.14 | 90.13 | **95.38** | 86.32 | 0.20 | 38.24 | 42.39 | 0.10 | 9.66 | 51.79 | 27.64 | 21.77 | 76.91 | 77.02 | 83.60 | **93.74** | 39.09 | 33.23 | 86.56 | 87.39 | 0.10 | 6.59 | **93.65** | 5.26 | 92.42 | 2.41 | 92.07 | 91.63 | 75.95 | 75.91 | 92.13 | 93.00 | **92.96** | **95.01** | 93.52 | 36.97 | 64.59 | 21.64 | **94.05** | 89.68 | 29.17 | 64.99 | 90.59 | 64.89 | 4.14 | 0.09 |
| WtPSplit | 76.90 | **59.08** | 68.08 | 76.42 | 71.29 | 93.97 | 79.76 | 89.79 | 89.36 | 73.21 | 90.02 | 80.74 | 92.80 | 91.91 | 92.24 | 92.11 | 84.47 | 87.24 | 59.97 | 91.96 | 88.53 | 65.84 | 79.49 | 83.33 | 90.31 | **70.51** | 82.43 | 90.58 | 66.70 | 93.00 | 87.14 | 89.80 | 94.77 | 87.43 | **41.79** | **91.26** | 73.25 | **69.54** | 68.98 | 56.21 | **79.12** | 83.94 | 81.33 | 82.70 | **89.33** | 92.87 | 80.81 | 73.26 | 89.20 | 88.51 | **65.54** | **71.33** | 92.63 | 64.11 | 92.72 | **62.84** | 91.05 | 90.91 | 84.23 | 80.32 | 92.30 | 92.19 | 90.32 | 94.76 | 92.08 | 63.48 | 76.49 | 68.88 | 93.30 | 89.60 | 52.59 | **77.79** | 91.29 | 80.28 | **75.70** | 71.64 |
| XLM-RoBERTa (ours) | **83.97** | 41.59 | **81.56** | **81.30** | **85.68** | **94.34** | **84.10** | **91.80** | **91.23** | **78.72** | **92.64** | **86.73** | **93.87** | **94.50** | **94.57** | **93.18** | **90.19** | **90.28** | **74.79** | **94.06** | **90.46** | **81.76** | **84.33** | **85.62** | **92.55** | 67.26 | 86.61 | 91.22 | **72.69** | **94.53** | **89.83** | **92.24** | 93.78 | **89.27** | 41.43 | 78.39 | **89.15** | 36.60 | **70.51** | **82.77** | 58.14 | **89.41** | **89.99** | **88.25** | 86.82 | 92.81 | **86.14** | **94.73** | **93.25** | **92.44** | 49.39 | 66.02 | 93.60 | **69.22** | **93.51** | 61.86 | **92.84** | **93.19** | **89.47** | **86.24** | **92.95** | **93.46** | 91.79 | 94.16 | **93.93** | **72.74** | **81.77** | **74.49** | 93.17 | **92.15** | **62.92** | 75.65 | **93.41** | **84.89** | 56.85 | **77.07** |
## Universal Dependencies [3]
Who wins most? XLM-RoBERTa: 24, WtPSplit: 17 Spacy (multilingual): 13
| | af | ar | be | bg | bn | ca | cs | cy | da | de | el | en | es | et | eu | fa | fi | fr | ga | gd | gl | he | hi | hu | hy | id | is | it | ja | jv | kk | ko | la | lt | lv | mr | nl | pl | pt | ro | ru | sk | sl | sq | sr | sv | ta | th | tr | uk | ur | vi | zh |
|:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
| Spacy (multilingual) | **98.47** | 80.38 | 80.27 | 93.62 | 51.85 | **98.95** | 89.68 | 98.89 | 94.96 | 88.02 | 94.16 | 92.20 | **98.70** | 93.77 | 95.79 | **99.83** | 92.88 | 96.33 | **96.67** | 63.04 | 92.37 | 94.37 | 0.32 | **98.45** | 11.39 | 98.01 | **95.41** | 92.49 | 0.37 | 98.03 | 96.21 | **99.80** | 0.09 | 93.86 | **98.52** | 92.13 | 92.86 | 97.02 | 94.91 | **98.05** | 84.31 | 90.26 | **98.23** | **100.00** | 97.84 | 94.91 | 66.67 | 1.95 | **97.63** | 94.16 | 0.37 | 96.40 | 0.40 |
| WtPSplit | 98.27 | **83.00** | 89.28 | **98.16** | **99.12** | 98.52 | 92.98 | **99.26** | 94.56 | 96.13 | **96.94** | 94.73 | 97.60 | 94.09 | 97.24 | 97.29 | 94.69 | **96.71** | 86.60 | 72.17 | **98.87** | 95.79 | 96.78 | 96.08 | **96.80** | **98.41** | 86.39 | 95.45 | **95.84** | **98.18** | 96.28 | 99.11 | 91.43 | **97.67** | 96.42 | 91.84 | 93.61 | 95.92 | **96.13** | 81.50 | 86.28 | 95.57 | 96.85 | 99.17 | **98.45** | **95.86** | **97.54** | 70.26 | 96.00 | 92.08 | 93.79 | 92.97 | **97.25** |
| XLM-RoBERTa (ours) | 96.81 | 78.99 | **91.60** | 97.89 | **99.12** | 95.99 | **96.05** | 97.17 | **96.62** | **96.29** | 94.33 | **94.76** | 95.73 | **96.20** | **97.37** | 97.49 | **96.34** | 95.70 | 89.78 | **84.20** | 95.72 | **95.95** | **97.51** | 96.24 | 95.62 | 97.22 | 92.93 | **96.88** | 94.23 | 96.29 | **98.40** | 97.46 | **96.35** | 95.82 | 96.91 | **95.92** | **96.27** | **97.24** | 95.83 | 94.63 | **91.59** | **95.88** | 96.43 | 98.36 | 96.83 | 94.95 | 95.93 | **89.26** | 96.52 | **94.59** | **96.20** | **97.31** | 95.12 |
## Ersatz [4]
Who wins most? XLM-RoBERTa: 10, WtPSplit: 8, Spacy (multilingual): 4
| | ar | cs | de | en | es | et | fi | fr | gu | hi | ja | kk | km | lt | lv | pl | ps | ro | ru | ta | tr | zh |
|:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
| Spacy (multilingual) | **91.26** | 96.46 | 93.89 | 94.40 | 97.31 | **97.15** | 94.99 | 96.43 | 4.44 | 18.41 | 0.18 | 97.11 | 0.08 | 93.53 | **98.73** | 93.69 | **94.44** | 94.87 | 93.45 | 68.65 | 95.39 | 0.10 |
| WtPSplit | 89.45 | 93.41 | 95.93 | **97.16** | **98.74** | 95.84 | 97.10 | **97.61** | 90.62 | 94.87 | **82.14** | 95.94 | **82.89** | **96.74** | 97.22 | 95.16 | 86.99 | **97.55** | **97.82** | 94.76 | 93.53 | 89.02 |
| XLM-RoBERTa (ours) | 79.78 | **96.94** | **97.02** | 96.10 | 97.06 | 96.80 | **97.67** | 96.33 | **93.73** | **95.34** | 77.54 | **97.28** | 78.94 | 96.13 | 96.45 | **96.71** | 92.33 | 96.24 | 97.15 | **95.94** | **95.76** | **90.11** |
## German--English code-switching [5]
| | de |
|:---------------------|:----------|
| Spacy (multilingual) | 79.55 |
| WtPSplit | 77.41 |
| XLM-RoBERTa (ours) | **85.78** |
[1] [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398) (Minixhofer et al., ACL 2023)
[2] [Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation](https://aclanthology.org/2020.acl-main.148) (Zhang et al., ACL 2020)
[3] [Universal Dependencies](https://aclanthology.org/2021.cl-2.11) (de Marneffe et al., CL 2021)
[4] [A unified approach to sentence segmentation of punctuated text in many languages](https://aclanthology.org/2021.acl-long.309) (Wicks & Post, ACL-IJCNLP 2021)
[5] [The Denglisch Corpus of German-English Code-Switching](https://aclanthology.org/2023.sigtyp-1.5) (Osmelak & Wintner, SIGTYP 2023)
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
### Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|
| No log | 0.2 | 100 | 0.0125 | 0.9320 | 0.9487 | 0.9403 |
| No log | 0.4 | 200 | 0.0099 | 0.9547 | 0.9513 | 0.9530 |
| No log | 0.6 | 300 | 0.0092 | 0.9616 | 0.9506 | 0.9561 |
| No log | 0.81 | 400 | 0.0083 | 0.9584 | 0.9618 | 0.9601 |
| 0.0212 | 1.01 | 500 | 0.0082 | 0.9551 | 0.9642 | 0.9596 |
| 0.0212 | 1.21 | 600 | 0.0084 | 0.9630 | 0.9614 | 0.9622 |
| 0.0212 | 1.41 | 700 | 0.0079 | 0.9606 | 0.9648 | 0.9627 |
| 0.0212 | 1.61 | 800 | 0.0077 | 0.9609 | 0.9661 | 0.9635 |
| 0.0212 | 1.81 | 900 | 0.0076 | 0.9623 | 0.9649 | 0.9636 |
| 0.0067 | 2.02 | 1000 | 0.0077 | 0.9598 | 0.9689 | 0.9643 |
| 0.0067 | 2.22 | 1100 | 0.0075 | 0.9614 | 0.9680 | 0.9647 |
| 0.0067 | 2.42 | 1200 | 0.0073 | 0.9626 | 0.9682 | 0.9654 |
| 0.0067 | 2.62 | 1300 | 0.0075 | 0.9617 | 0.9692 | 0.9654 |
| 0.0067 | 2.82 | 1400 | 0.0073 | 0.9658 | 0.9648 | 0.9653 |
| 0.0054 | 3.02 | 1500 | 0.0076 | 0.9656 | 0.9663 | 0.9660 |
| 0.0054 | 3.23 | 1600 | 0.0073 | 0.9625 | 0.9703 | 0.9664 |
| 0.0054 | 3.43 | 1700 | 0.0073 | 0.9658 | 0.9659 | 0.9658 |
| 0.0054 | 3.63 | 1800 | 0.0073 | 0.9626 | 0.9707 | 0.9666 |
| 0.0054 | 3.83 | 1900 | 0.0073 | 0.9659 | 0.9677 | 0.9668 |
| 0.0046 | 4.03 | 2000 | 0.0075 | 0.9671 | 0.9659 | 0.9665 |
| 0.0046 | 4.23 | 2100 | 0.0075 | 0.9654 | 0.9687 | 0.9671 |
| 0.0046 | 4.44 | 2200 | 0.0075 | 0.9662 | 0.9676 | 0.9669 |
| 0.0046 | 4.64 | 2300 | 0.0074 | 0.9657 | 0.9684 | 0.9670 |
| 0.0046 | 4.84 | 2400 | 0.0074 | 0.9664 | 0.9678 | 0.9671 |
### Framework versions
- Transformers 4.39.1
- Pytorch 2.2.1+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2
# Citation
Please consider citing our paper if this model has helped you:
```
@inproceedings{frohman-etal-2024-segment,
title = "Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation",
author={Markus Frohmann and Igor Sterner and Ivan Vulić and Benjamin Minixhofer and Markus Schedl},
month = nov,
year = "2024",
publisher = "Association for Computational Linguistics",
}
```