GiliGold
/

Knesset-DictaBERT

+---
+license: cc-by-sa-4.0
+datasets:
+- HaifaCLGroup/KnessetCorpus
+language:
+- he
+tags:
+- hebrew
+- nlp
+- masked-language-model
+- transformers
+- BERT
+- parliamentary-proceedings
+- language-model
+- Knesset
+- DictaBERT
+- fine-tuning
+---
+# Knesset-DictaBERT
+**Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus),
+which comprises Israeli parliamentary proceedings.
+This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture
+and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.
+## Model Details
+- **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers)
+- **Language**: Hebrew
+- **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
+- **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)
+## Training Procedure
+The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.
+## Usage
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+import torch
+tokenizer = AutoTokenizer.from_pretrained("your-username/Knesset-DictaBERT")
+model = AutoModelForMaskedLM.from_pretrained("your-username/Knesset-DictaBERT")
+model.eval()
+sentence = "הכנסת היא הרשות [MASK] של מדינת ישראל."
+# Tokenize the input sentence and get predictions
+inputs = tokenizer.encode(sentence, return_tensors='pt')
+output = model(inputs)
+# The [MASK] token is the 5th token in the sentence (including [CLS])
+mask_token_index = 5
+top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]
+# Convert token IDs to tokens and print them
+print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))
+# Example output: המבצעת / המחוקקת
+```
+## Evaluation
+The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
+The perplexity was calculated on this full test set.
+Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 3 million sentences (approximately 520 million tokens).
+#### Perplexity
+The perplexity of the original DictaBERT on the full test set is 22.87.
+The perplexity of Knesset-DictaBERT on the full test set is 6.60.
+#### Accuracy
+- **1-accuracy results**
+Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.
+The original DictaBERT model achieved a top-1 accuracy of 48.02%.
+- **2-accuracy results**
+Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.
+The original Dicta model achieved a top-2 accuracy of 58.60%.
+- **5-accuracy results**
+Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.
+The original Dicta model achieved a top-5 accuracy of 68.98%.
+## Acknowledgments
+This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.
+## Citation
+If you use this model in your work, please cite:
+@misc{Knesset-DictaBERT,
+  author = {Gili Goldin},
+  title = {Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
+  year = {2024},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/GiliGold/Knesset-DictaBERT}},
+}