tkesgin commited on
Commit
9a9c048
1 Parent(s): 31dcbe5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
2
- license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ widget:
3
+ - text: "gelirken bir litre [MASK] aldım."
4
+ example_title: "Örnek 1"
5
  ---
6
+
7
+ # turkish-small-bert-uncased
8
+
9
+ This is a Turkish Small uncased BERT model, developed to fill the gap for small-sized BERT models for Turkish. Since this model is uncased: it does not make a difference between turkish and Turkish.
10
+
11
+ #### ⚠ Uncased use requires manual lowercase conversion
12
+
13
+
14
+ **Don't** use the `do_lower_case = True` flag with the tokenizer. Instead, convert your text to lower case as follows:
15
+ ```python
16
+ text.replace("I", "ı").lower()
17
+ ```
18
+ This is due to a [known issue](https://github.com/huggingface/transformers/issues/6680) with the tokenizer.
19
+
20
+ Be aware that this model may exhibit biased predictions as it was trained primarily on crawled data, which inherently can contain various biases.
21
+
22
+ Other relevant information can be found in the [paper](https://arxiv.org/abs/2307.14134).
23
+
24
+
25
+ ## Example Usage
26
+ ```python
27
+ from transformers import AutoTokenizer, BertForMaskedLM
28
+ from transformers import pipeline
29
+
30
+ model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-small-bert-uncased")
31
+ # or
32
+ # model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-small-bert-uncased", from_tf = True)
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/turkish-small-bert-uncased")
35
+
36
+ unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
37
+ unmasker("gelirken bir litre [MASK] aldım.")
38
+ [{'score': 0.3692811131477356,
39
+ 'token': 2417,
40
+ 'token_str': 'su',
41
+ 'sequence': 'gelirken bir litre su aldım.'},
42
+ {'score': 0.2551537752151489,
43
+ 'token': 11818,
44
+ 'token_str': 'benzin',
45
+ 'sequence': 'gelirken bir litre benzin aldım.'},
46
+ {'score': 0.036265160888433456,
47
+ 'token': 29480,
48
+ 'token_str': 'mazot',
49
+ 'sequence': 'gelirken bir litre mazot aldım.'},
50
+ {'score': 0.03350532799959183,
51
+ 'token': 4521,
52
+ 'token_str': 'süt',
53
+ 'sequence': 'gelirken bir litre süt aldım.'},
54
+ {'score': 0.02558029256761074,
55
+ 'token': 2168,
56
+ 'token_str': 'daha',
57
+ 'sequence': 'gelirken bir litre daha aldım.'}]
58
+ ```
59
+
60
+
61
+ # Acknowledgments
62
+ - Research supported with Cloud TPUs from [Google's TensorFlow Research Cloud](https://sites.research.google/trc/about/) (TFRC). Thanks for providing access to the TFRC ❤️
63
+ - Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗
64
+
65
+ # Citations
66
+ ```bibtex
67
+ @article{kesgin2023developing,
68
+ title={Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models},
69
+ author={Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
70
+ journal={arXiv preprint arXiv:2307.14134},
71
+ year={2023}
72
+ }
73
+ ```
74
+
75
+ # License
76
+
77
+ MIT