burakaytan
/

roberta-small-turkish-clean-uncased

Inference Endpoints

Model card Files Files and versions Community

burakaytan commited on Nov 29, 2023

Commit

0519b79

•

1 Parent(s): aba33b4

Update README.md

Files changed (1) hide show

README.md +75 -1

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
 ---
-license: apache-2.0
 ---

 ---
+language: tr
+license: mit
 ---
+🇹🇷 RoBERTaTurk-Small-Clean
+## Model description
+This is a Turkish small clean RoBERTa model, trained to understand Turkish language better.
+We used special, clean data from Turkish Wikipedia, Turkish OSCAR, and news websites.
+First, we had 38 GB of data, but we took out all the sentences with mistakes in them.
+So, the model was trained with 20 GB of good quality data. This helps the model work really well with Turkish texts that don't have errors.
+The model is a bit smaller than the usual RoBERTa model. It has 8 layers instead of 12, which makes it faster and easier to use but still very good at understanding Turkish.
+It's built to be really good at understanding Turkish, especially when the texts are written correctly without errors.
+Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 1.5M steps.
+# Usage
+Load transformers library with:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
+model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
+```
+# Fill Mask Usage
+```python
+from transformers import pipeline
+fill_mask = pipeline(
+    "fill-mask",
+    model="burakaytan/roberta-small-turkish-clean-uncased",
+    tokenizer="burakaytan/roberta-small-turkish-clean-uncased"
+)
+fill_mask("iki ülke arasında <mask> başladı")
+[{'sequence': 'iki ülke arasında savaş başladı',
+  'score': 0.14830906689167023,
+  'token': 1745,
+  'token_str': ' savaş'},
+ {'sequence': 'iki ülke arasında çatışmalar başladı',
+  'score': 0.1442396193742752,
+  'token': 18223,
+  'token_str': ' çatışmalar'},
+ {'sequence': 'iki ülke arasında gerginlik başladı',
+  'score': 0.12025047093629837,
+  'token': 13638,
+  'token_str': ' gerginlik'},
+ {'sequence': 'iki ülke arasında çatışma başladı',
+  'score': 0.0615813322365284,
+  'token': 5452,
+  'token_str': ' çatışma'},
+ {'sequence': 'iki ülke arasında görüşmeler başladı',
+  'score': 0.04512731358408928,
+  'token': 4736,
+  'token_str': ' görüşmeler'}]
+```
+## Citation and Related Information
+To cite this model:
+```bibtex
+@article{aytan2023deep,
+  title={Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model},
+  author={AYTAN, BURAK and {\c{S}}AKAR, CEMAL OKAN},
+  journal={Turkish Journal of Electrical Engineering and Computer Sciences},
+  volume={31},
+  number={3},
+  pages={581--595},
+  year={2023}
+}
+```