File size: 2,548 Bytes
39d8722 68e3bdd 39d8722 9755e32 39d8722 ac8158f 39d8722 c3e30de d0a064b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
license: cc-by-sa-4.0
datasets:
- cjvt/cc_gigafida
- cjvt/solar3
- cjvt/sloleks
language:
- sl
tags:
- word spelling error annotator
---
---
language:
- sl
license: cc-by-sa-4.0
---
# SloBERTa-Incorrect-Spelling-Annotator
This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:
- 1: Indicates incorrectly spelled words.
- 2: Denotes cases where two words should be written together.
- 3: Suggests that a word should be written separately.
## Model Output Example
Imagine we have the following Slovenian text:
_Model vbesedilu o znači besede, v katerih se najajajo napake._
If we convert input data to format acceptable by SloBERTa model:
_Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
_Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_
We can observe the following:
1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0).
2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3).
3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2).
## More details
The model, along with its training and evaluation, is described in more detail in the following paper.
```
@inproceedings{neural-spell-checker,
author = {Klemen, Matej and Bo\v{z}i\v{c}, Martin and Holdt, \v{S}pela Arhar and Robnik-\v{S}ikonja, Marko},
title = {Neural Spell-Checker: Beyond Words with Synthetic Data Generation},
year = {2024},
doi = {10.1007/978-3-031-70563-2_7},
booktitle = {Text, Speech, and Dialogue: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024, Proceedings, Part I},
pages = {85–96},
numpages = {12}
}
```
## Acknowledgement
The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.
## Authors
Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models. |