Shaltiel commited on
Commit
3811073
โ€ข
1 Parent(s): cb21837

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md CHANGED
@@ -1,3 +1,138 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ language:
4
+ - he
5
+ inference: false
6
  ---
7
+ # DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew
8
+
9
+ State-of-the-art language model for Hebrew, released [here](https://arxiv.org/abs/2308.16687).
10
+
11
+ This is the fine-tuned model for the lemmatization task.
12
+
13
+ For the bert-base models for other tasks, see [here](https://huggingface.co/collections/dicta-il/dictabert-6588e7cc08f83845fc42a18b).
14
+
15
+ ## General guidelines for how the lemmatizer works:
16
+
17
+ Given an input text in Hebrew, it attempts to match up each word with the correct lexeme in its vocabulary.
18
+
19
+ - If the token is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
20
+
21
+ - If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
22
+
23
+ - For verbs the lexeme is the 3rd person past singular form.
24
+
25
+ This method is purely neural-based, so sometimes the predicted lexeme may not match exactly and can be in a similar semantic space. For more accurate results, one can implement rules on top of the prediction to look at the top K matches and choose using a specific set of rules.
26
+
27
+ Sample usage:
28
+
29
+ ```python
30
+ from transformers import AutoModel, AutoTokenizer
31
+
32
+ tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-lex')
33
+ model = AutoModel.from_pretrained('dicta-il/dictabert-lex', trust_remote_code=True)
34
+
35
+ model.eval()
36
+
37
+ sentence = 'ื‘ืฉื ืช 1948 ื”ืฉืœื™ื ืืคืจื™ื ืงื™ืฉื•ืŸ ืืช ืœื™ืžื•ื“ื™ื• ื‘ืคื™ืกื•ืœ ืžืชื›ืช ื•ื‘ืชื•ืœื“ื•ืช ื”ืืžื ื•ืช ื•ื”ื—ืœ ืœืคืจืกื ืžืืžืจื™ื ื”ื•ืžื•ืจื™ืกื˜ื™ื™ื'
38
+ print(model.predict([sentence], tokenizer))
39
+ ```
40
+
41
+ Output:
42
+ ```json
43
+ [
44
+ [
45
+ [
46
+ "ื‘ืฉื ืช",
47
+ "ืฉื ื”"
48
+ ],
49
+ [
50
+ "1948",
51
+ "1948"
52
+ ],
53
+ [
54
+ "ื”ืฉืœื™ื",
55
+ "ื”ืฉืœื™ื"
56
+ ],
57
+ [
58
+ "ืืคืจื™ื",
59
+ "ืืคืจื™ื"
60
+ ],
61
+ [
62
+ "ืงื™ืฉื•ืŸ",
63
+ "ืงื™ืฉื•ืŸ"
64
+ ],
65
+ [
66
+ "ืืช",
67
+ "ืืช"
68
+ ],
69
+ [
70
+ "ืœื™ืžื•ื“ื™ื•",
71
+ "ืœื™ืžื•ื“"
72
+ ],
73
+ [
74
+ "ื‘ืคื™ืกื•ืœ",
75
+ "ืคื™ืกื•ืœ"
76
+ ],
77
+ [
78
+ "ืžืชื›ืช",
79
+ "ืžืชื›ืช"
80
+ ],
81
+ [
82
+ "ื•ื‘ืชื•ืœื“ื•ืช",
83
+ "ืชื•ืœื“ื”"
84
+ ],
85
+ [
86
+ "ื”ืืžื ื•ืช",
87
+ "ืื•ืžื ื•ืช"
88
+ ],
89
+ [
90
+ "ื•ื”ื—ืœ",
91
+ "ื”ื—ืœ"
92
+ ],
93
+ [
94
+ "ืœืคืจืกื",
95
+ "ืคืจืกื"
96
+ ],
97
+ [
98
+ "ืžืืžืจื™ื",
99
+ "ืžืืžืจ"
100
+ ],
101
+ [
102
+ "ื”ื•ืžื•ืจื™ืกื˜ื™ื™ื",
103
+ "ื”ื•ืžื•ืจื™ืกื˜ื™"
104
+ ]
105
+ ]
106
+ ]
107
+ ```
108
+
109
+
110
+ ## Citation
111
+
112
+ If you use DictaBERT in your research, please cite ```DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew```
113
+
114
+ **BibTeX:**
115
+
116
+ ```bibtex
117
+ @misc{shmidman2023dictabert,
118
+ title={DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew},
119
+ author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
120
+ year={2023},
121
+ eprint={2308.16687},
122
+ archivePrefix={arXiv},
123
+ primaryClass={cs.CL}
124
+ }
125
+ ```
126
+
127
+ ## License
128
+
129
+ Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
130
+
131
+ This work is licensed under a
132
+ [Creative Commons Attribution 4.0 International License][cc-by].
133
+
134
+ [![CC BY 4.0][cc-by-image]][cc-by]
135
+
136
+ [cc-by]: http://creativecommons.org/licenses/by/4.0/
137
+ [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
138
+ [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg