cjvt
/

SloBERTa-slo-word-spelling-annotator

word spelling error annotator

Inference Endpoints

Model card Files Files and versions Community

SloBERTa-slo-word-spelling-annotator / README.md

Martin97Bozic's picture

Update README.md

1e35f80 about 1 year ago

|

2.44 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- cjvt/cc_gigafida
	- cjvt/solar3
	- cjvt/sloleks
	language:
	- sl
	tags:
	- word spelling error annotator
	---

	---
	language:
	- sl

	license: cc-by-sa-4.0
	---

	# SloBERTa-Incorrect-Spelling-Annotator

	This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

	- 1: Indicates incorrectly spelled words.
	- 2: Denotes cases where two words should be written together.
	- 3: Suggests that a word should be written separately.

	## Model Output Example

	Imagine we have the following Slovenian text:

	_Model vbesedilu o znači besede, v katerih se najajajo napake._

	If we convert input data to format acceptable by SloBERTa model:

	_Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_

	The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

	_Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_

	We can observe the following:
	1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0).
	2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3).
	3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2).

	## More details

	Testing model with generated test sets provides following result:

	- `1` token prediction -> Precission: 0,911; Recall: 0,975; F1: 0,942

	Testing the model with test sets constructed using the Šolar Eval dataset provides the following results:


	- `1` token prediction -> Precission: 0,900; Recall: 0,860; F1: 0,880
	- `2` token prediction -> Precission: 0,826; Recall:0,853; F1: 0,839
	- `3` token prediction -> Precission: 0,518; Recall: 0,671; F1: 0,585

	## Acknowledgement

	The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

	## Authors

	Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.