dicta-il
/

dictabert-lex

 ---
 license: cc-by-4.0
+language:
+- he
+inference: false
 ---
+# DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew
+State-of-the-art language model for Hebrew, released [here](https://arxiv.org/abs/2308.16687).
+This is the fine-tuned model for the lemmatization task.
+For the bert-base models for other tasks, see [here](https://huggingface.co/collections/dicta-il/dictabert-6588e7cc08f83845fc42a18b).
+## General guidelines for how the lemmatizer works:
+Given an input text in Hebrew, it attempts to match up each word with the correct lexeme in its vocabulary.
+- If the token is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
+- If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
+- For verbs the lexeme is the 3rd person past singular form.
+This method is purely neural-based, so sometimes the predicted lexeme may not match exactly and can be in a similar semantic space. For more accurate results, one can implement rules on top of the prediction to look at the top K matches and choose using a specific set of rules.
+Sample usage:
+```python
+from transformers import AutoModel, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-lex')
+model = AutoModel.from_pretrained('dicta-il/dictabert-lex', trust_remote_code=True)
+model.eval()
+sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
+print(model.predict([sentence], tokenizer))
+```
+Output:
+```json
+[
+  [
+    [
+      "בשנת",
+      "שנה"
+    ],
+    [
+      "1948",
+      "1948"
+    ],
+    [
+      "השלים",
+      "השלים"
+    ],
+    [
+      "אפרים",
+      "אפרים"
+    ],
+    [
+      "קישון",
+      "קישון"
+    ],
+    [
+      "את",
+      "את"
+    ],
+    [
+      "לימודיו",
+      "לימוד"
+    ],
+    [
+      "בפיסול",
+      "פיסול"
+    ],
+    [
+      "מתכת",
+      "מתכת"
+    ],
+    [
+      "ובתולדות",
+      "תולדה"
+    ],
+    [
+      "האמנות",
+      "אומנות"
+    ],
+    [
+      "והחל",
+      "החל"
+    ],
+    [
+      "לפרסם",
+      "פרסם"
+    ],
+    [
+      "מאמרים",
+      "מאמר"
+    ],
+    [
+      "הומוריסטיים",
+      "הומוריסטי"
+    ]
+  ]
+]
+```
+## Citation
+If you use DictaBERT in your research, please cite ```DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew```
+**BibTeX:**
+```bibtex
+@misc{shmidman2023dictabert,
+      title={DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew},
+      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
+      year={2023},
+      eprint={2308.16687},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## License
+Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+[![CC BY 4.0][cc-by-image]][cc-by]
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg