cjvt
/

Martin97Bozic commited on
Commit
39d8722
·
1 Parent(s): baa6e7e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - cjvt/cc_gigafida
5
+ - cjvt/solar3
6
+ - cjvt/sloleks
7
+ language:
8
+ - sl
9
+ tags:
10
+ - word spelling error annotator
11
+ ---
12
+
13
+ ---
14
+ language:
15
+ - sl
16
+
17
+ license: cc-by-sa-4.0
18
+ ---
19
+
20
+ # SloBERTa-Incorrect-Spelling-Annotator
21
+
22
+ This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:
23
+
24
+ - 1: Indicates incorrectly spelled words.
25
+ - 2: Denotes cases where two words should be written together.
26
+ - 3: Suggests that a word should be written separately.
27
+
28
+ ## Model Output Example
29
+
30
+ Imagine we have the following Slovenian text:
31
+
32
+ _Model vbesedilu o znači besede, v katerih se najajajo napake._
33
+
34
+ If we convert model to SloBERTa model
35
+
36
+ _Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_
37
+
38
+ The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
39
+
40
+ _Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_
41
+
42
+ We can observe the following:
43
+ 1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0).
44
+ 2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3).
45
+ 3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2).
46
+
47
+ ## More details
48
+
49
+ Testing model **generated** test sets provides following result:
50
+
51
+ - `1` token prediction -> Precission: 0,979; Recall: 0,986; F1: 0,983
52
+
53
+ Testing the model with test sets constructed using the **Šolar** dataset provides the following results (combining detection and correction of words with incorrect spelling):
54
+
55
+
56
+ - `1` token prediction -> Precission: 0,753; Recall: 0,873; F1: 0,809
57
+ - `2` token prediction -> Precission: 0,516; Recall:0,671; F1: 0,585
58
+ - `3` token prediction -> Precission: 0,826; Recall: 0,853; F1: 0,839
59
+
60
+ ## Acknowledgement
61
+
62
+ The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.