Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,103 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-sa-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-4.0
|
3 |
+
datasets:
|
4 |
+
- HaifaCLGroup/KnessetCorpus
|
5 |
+
language:
|
6 |
+
- he
|
7 |
+
tags:
|
8 |
+
- hebrew
|
9 |
+
- nlp
|
10 |
+
- masked-language-model
|
11 |
+
- transformers
|
12 |
+
- BERT
|
13 |
+
- parliamentary-proceedings
|
14 |
+
- language-model
|
15 |
+
- Knesset
|
16 |
+
- DictaBERT
|
17 |
+
- fine-tuning
|
18 |
+
|
19 |
+
---
|
20 |
+
# Knesset-DictaBERT
|
21 |
+
**Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus),
|
22 |
+
which comprises Israeli parliamentary proceedings.
|
23 |
+
|
24 |
+
This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture
|
25 |
+
and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.
|
26 |
+
|
27 |
+
|
28 |
+
## Model Details
|
29 |
+
|
30 |
+
- **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers)
|
31 |
+
- **Language**: Hebrew
|
32 |
+
- **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
|
33 |
+
- **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)
|
34 |
+
|
35 |
+
## Training Procedure
|
36 |
+
|
37 |
+
The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.
|
38 |
+
|
39 |
+
## Usage
|
40 |
+
```python
|
41 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
42 |
+
import torch
|
43 |
+
|
44 |
+
tokenizer = AutoTokenizer.from_pretrained("your-username/Knesset-DictaBERT")
|
45 |
+
model = AutoModelForMaskedLM.from_pretrained("your-username/Knesset-DictaBERT")
|
46 |
+
|
47 |
+
model.eval()
|
48 |
+
|
49 |
+
sentence = "הכנסת היא הרשות [MASK] של מדינת ישראל."
|
50 |
+
|
51 |
+
# Tokenize the input sentence and get predictions
|
52 |
+
inputs = tokenizer.encode(sentence, return_tensors='pt')
|
53 |
+
output = model(inputs)
|
54 |
+
|
55 |
+
# The [MASK] token is the 5th token in the sentence (including [CLS])
|
56 |
+
mask_token_index = 5
|
57 |
+
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]
|
58 |
+
|
59 |
+
# Convert token IDs to tokens and print them
|
60 |
+
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))
|
61 |
+
|
62 |
+
# Example output: המבצעת / המחוקקת
|
63 |
+
|
64 |
+
```
|
65 |
+
|
66 |
+
## Evaluation
|
67 |
+
The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
|
68 |
+
The perplexity was calculated on this full test set.
|
69 |
+
Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 3 million sentences (approximately 520 million tokens).
|
70 |
+
|
71 |
+
#### Perplexity
|
72 |
+
The perplexity of the original DictaBERT on the full test set is 22.87.
|
73 |
+
The perplexity of Knesset-DictaBERT on the full test set is 6.60.
|
74 |
+
#### Accuracy
|
75 |
+
- **1-accuracy results**
|
76 |
+
|
77 |
+
Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.
|
78 |
+
The original DictaBERT model achieved a top-1 accuracy of 48.02%.
|
79 |
+
|
80 |
+
|
81 |
+
- **2-accuracy results**
|
82 |
+
|
83 |
+
Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.
|
84 |
+
The original Dicta model achieved a top-2 accuracy of 58.60%.
|
85 |
+
|
86 |
+
|
87 |
+
- **5-accuracy results**
|
88 |
+
Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.
|
89 |
+
The original Dicta model achieved a top-5 accuracy of 68.98%.
|
90 |
+
|
91 |
+
## Acknowledgments
|
92 |
+
This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.
|
93 |
+
|
94 |
+
## Citation
|
95 |
+
If you use this model in your work, please cite:
|
96 |
+
|
97 |
+
@misc{Knesset-DictaBERT,
|
98 |
+
author = {Gili Goldin},
|
99 |
+
title = {Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
|
100 |
+
year = {2024},
|
101 |
+
publisher = {Hugging Face},
|
102 |
+
howpublished = {\url{https://huggingface.co/GiliGold/Knesset-DictaBERT}},
|
103 |
+
}
|