burakaytan
commited on
Commit
•
0519b79
1
Parent(s):
aba33b4
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: tr
|
3 |
+
license: mit
|
4 |
---
|
5 |
+
🇹🇷 RoBERTaTurk-Small-Clean
|
6 |
+
|
7 |
+
## Model description
|
8 |
+
This is a Turkish small clean RoBERTa model, trained to understand Turkish language better.
|
9 |
+
We used special, clean data from Turkish Wikipedia, Turkish OSCAR, and news websites.
|
10 |
+
First, we had 38 GB of data, but we took out all the sentences with mistakes in them.
|
11 |
+
So, the model was trained with 20 GB of good quality data. This helps the model work really well with Turkish texts that don't have errors.
|
12 |
+
|
13 |
+
The model is a bit smaller than the usual RoBERTa model. It has 8 layers instead of 12, which makes it faster and easier to use but still very good at understanding Turkish.
|
14 |
+
|
15 |
+
It's built to be really good at understanding Turkish, especially when the texts are written correctly without errors.
|
16 |
+
Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 1.5M steps.
|
17 |
+
|
18 |
+
# Usage
|
19 |
+
Load transformers library with:
|
20 |
+
```python
|
21 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
22 |
+
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
|
24 |
+
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
|
25 |
+
```
|
26 |
+
|
27 |
+
|
28 |
+
# Fill Mask Usage
|
29 |
+
|
30 |
+
```python
|
31 |
+
from transformers import pipeline
|
32 |
+
|
33 |
+
fill_mask = pipeline(
|
34 |
+
"fill-mask",
|
35 |
+
model="burakaytan/roberta-small-turkish-clean-uncased",
|
36 |
+
tokenizer="burakaytan/roberta-small-turkish-clean-uncased"
|
37 |
+
)
|
38 |
+
|
39 |
+
fill_mask("iki ülke arasında <mask> başladı")
|
40 |
+
|
41 |
+
[{'sequence': 'iki ülke arasında savaş başladı',
|
42 |
+
'score': 0.14830906689167023,
|
43 |
+
'token': 1745,
|
44 |
+
'token_str': ' savaş'},
|
45 |
+
{'sequence': 'iki ülke arasında çatışmalar başladı',
|
46 |
+
'score': 0.1442396193742752,
|
47 |
+
'token': 18223,
|
48 |
+
'token_str': ' çatışmalar'},
|
49 |
+
{'sequence': 'iki ülke arasında gerginlik başladı',
|
50 |
+
'score': 0.12025047093629837,
|
51 |
+
'token': 13638,
|
52 |
+
'token_str': ' gerginlik'},
|
53 |
+
{'sequence': 'iki ülke arasında çatışma başladı',
|
54 |
+
'score': 0.0615813322365284,
|
55 |
+
'token': 5452,
|
56 |
+
'token_str': ' çatışma'},
|
57 |
+
{'sequence': 'iki ülke arasında görüşmeler başladı',
|
58 |
+
'score': 0.04512731358408928,
|
59 |
+
'token': 4736,
|
60 |
+
'token_str': ' görüşmeler'}]
|
61 |
+
```
|
62 |
+
## Citation and Related Information
|
63 |
+
|
64 |
+
To cite this model:
|
65 |
+
```bibtex
|
66 |
+
|
67 |
+
@article{aytan2023deep,
|
68 |
+
title={Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model},
|
69 |
+
author={AYTAN, BURAK and {\c{S}}AKAR, CEMAL OKAN},
|
70 |
+
journal={Turkish Journal of Electrical Engineering and Computer Sciences},
|
71 |
+
volume={31},
|
72 |
+
number={3},
|
73 |
+
pages={581--595},
|
74 |
+
year={2023}
|
75 |
+
}
|
76 |
+
|
77 |
+
```
|