cjvt
/

File size: 2,434 Bytes
39d8722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68e3bdd
39d8722
9755e32
39d8722
 
 
 
 
 
 
 
 
 
 
 
a3a7845
39d8722
c1ad411
39d8722
a3a7845
39d8722
 
c3e30de
 
 
39d8722
 
 
c3e30de
 
 
 
d0a064b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: cc-by-sa-4.0
datasets:
- cjvt/cc_gigafida
- cjvt/solar3
- cjvt/sloleks
language:
- sl
tags:
- word spelling error annotator
---

---
language: 
- sl

license: cc-by-sa-4.0
---

# SloBERTa-Incorrect-Spelling-Annotator

This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

- 1: Indicates incorrectly spelled words.
- 2: Denotes cases where two words should be written together.
- 3: Suggests that a word should be written separately.

## Model Output Example

Imagine we have the following Slovenian text:

_Model vbesedilu o znači besede, v katerih se najajajo napake._

If we convert input data to format acceptable by SloBERTa model:

_Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

_Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_

We can observe the following:
1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0).
2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3).
3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2).

## More details

Testing model with **generated** test sets provides following result:

- `1` token prediction -> Precission: 0,911; Recall: 0,975; F1: 0,942

Testing the model with test sets constructed using the **Šolar** dataset provides the following results:


- `1` token prediction -> Precission: 0,900; Recall: 0,860; F1: 0,880
- `2` token prediction -> Precission: 0,826; Recall:0,853; F1: 0,839
- `3` token prediction -> Precission: 0,518; Recall: 0,671; F1: 0,585

## Acknowledgement

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

## Authors

Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.