Martin97Bozic
commited on
Commit
·
39d8722
1
Parent(s):
baa6e7e
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-4.0
|
3 |
+
datasets:
|
4 |
+
- cjvt/cc_gigafida
|
5 |
+
- cjvt/solar3
|
6 |
+
- cjvt/sloleks
|
7 |
+
language:
|
8 |
+
- sl
|
9 |
+
tags:
|
10 |
+
- word spelling error annotator
|
11 |
+
---
|
12 |
+
|
13 |
+
---
|
14 |
+
language:
|
15 |
+
- sl
|
16 |
+
|
17 |
+
license: cc-by-sa-4.0
|
18 |
+
---
|
19 |
+
|
20 |
+
# SloBERTa-Incorrect-Spelling-Annotator
|
21 |
+
|
22 |
+
This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:
|
23 |
+
|
24 |
+
- 1: Indicates incorrectly spelled words.
|
25 |
+
- 2: Denotes cases where two words should be written together.
|
26 |
+
- 3: Suggests that a word should be written separately.
|
27 |
+
|
28 |
+
## Model Output Example
|
29 |
+
|
30 |
+
Imagine we have the following Slovenian text:
|
31 |
+
|
32 |
+
_Model vbesedilu o znači besede, v katerih se najajajo napake._
|
33 |
+
|
34 |
+
If we convert model to SloBERTa model
|
35 |
+
|
36 |
+
_Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_
|
37 |
+
|
38 |
+
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
|
39 |
+
|
40 |
+
_Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_
|
41 |
+
|
42 |
+
We can observe the following:
|
43 |
+
1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0).
|
44 |
+
2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3).
|
45 |
+
3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2).
|
46 |
+
|
47 |
+
## More details
|
48 |
+
|
49 |
+
Testing model **generated** test sets provides following result:
|
50 |
+
|
51 |
+
- `1` token prediction -> Precission: 0,979; Recall: 0,986; F1: 0,983
|
52 |
+
|
53 |
+
Testing the model with test sets constructed using the **Šolar** dataset provides the following results (combining detection and correction of words with incorrect spelling):
|
54 |
+
|
55 |
+
|
56 |
+
- `1` token prediction -> Precission: 0,753; Recall: 0,873; F1: 0,809
|
57 |
+
- `2` token prediction -> Precission: 0,516; Recall:0,671; F1: 0,585
|
58 |
+
- `3` token prediction -> Precission: 0,826; Recall: 0,853; F1: 0,839
|
59 |
+
|
60 |
+
## Acknowledgement
|
61 |
+
|
62 |
+
The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.
|