language:

license: cc-by-sa-4.0

SloBERTa-Incorrect-Spelling-Annotator

This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

1: Indicates incorrectly spelled words.
2: Denotes cases where two words should be written together.
3: Suggests that a word should be written separately.

Model Output Example

Imagine we have the following Slovenian text:

Model vbesedilu o znači besede, v katerih se najajajo napake.

If we convert input data to format acceptable by SloBERTa model:

Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0

We can observe the following:

In the input sentence, the word najajajo is spelled incorrectly, so the model marks it with the token (0).
The word vbesedilu should be written as two words v and besedilu, so the model marks it with the token (3).
The words o and znači should be written as one word označi, so the model marks them with the tokens (2).

More details

The model, along with its training and evaluation, is described in more detail in the following paper.

@inproceedings{neural-spell-checker,
author = {Klemen, Matej and Bo\v{z}i\v{c}, Martin and Holdt, \v{S}pela Arhar and Robnik-\v{S}ikonja, Marko},
title = {Neural Spell-Checker: Beyond Words with Synthetic Data Generation},
year = {2024},
doi = {10.1007/978-3-031-70563-2_7},
booktitle = {Text, Speech, and Dialogue: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024, Proceedings, Part I},
pages = {85–96},
numpages = {12}
}

Acknowledgement

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

Authors

Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.

cjvt
/

SloBERTa-slo-word-spelling-annotator