|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- cjvt/cc_gigafida |
|
- cjvt/solar3 |
|
- cjvt/sloleks |
|
language: |
|
- sl |
|
tags: |
|
- word spelling error annotator |
|
--- |
|
|
|
--- |
|
language: |
|
- sl |
|
|
|
license: cc-by-sa-4.0 |
|
--- |
|
|
|
# SloBERTa-Incorrect-Spelling-Annotator |
|
|
|
This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels: |
|
|
|
- 1: Indicates incorrectly spelled words. |
|
- 2: Denotes cases where two words should be written together. |
|
- 3: Suggests that a word should be written separately. |
|
|
|
## Model Output Example |
|
|
|
Imagine we have the following Slovenian text: |
|
|
|
_Model vbesedilu o znači besede, v katerih se najajajo napake._ |
|
|
|
If we convert input data to format acceptable by SloBERTa model: |
|
|
|
_Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_ |
|
|
|
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!): |
|
|
|
_Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_ |
|
|
|
We can observe the following: |
|
1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0). |
|
2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3). |
|
3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2). |
|
|
|
## More details |
|
|
|
Testing model with **generated** test sets provides following result: |
|
|
|
- `1` token prediction -> Precission: 0,911; Recall: 0,975; F1: 0,942 |
|
|
|
Testing the model with test sets constructed using the **Šolar Eval** dataset provides the following results: |
|
|
|
|
|
- `1` token prediction -> Precission: 0,900; Recall: 0,860; F1: 0,880 |
|
- `2` token prediction -> Precission: 0,826; Recall:0,853; F1: 0,839 |
|
- `3` token prediction -> Precission: 0,518; Recall: 0,671; F1: 0,585 |
|
|
|
## Acknowledgement |
|
|
|
The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills. |
|
|
|
## Authors |
|
|
|
Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models. |