File size: 3,600 Bytes
091e64e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: cc-by-sa-4.0
datasets:
- HaifaCLGroup/KnessetCorpus
language:
- he
tags:
- hebrew
- nlp
- masked-language-model
- transformers
- BERT
- parliamentary-proceedings
- language-model
- Knesset
- DictaBERT
- fine-tuning

---
# Knesset-DictaBERT
**Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus), 
which comprises Israeli parliamentary proceedings. 

This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture 
and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.


## Model Details

- **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers)
- **Language**: Hebrew
- **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
- **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)

## Training Procedure

The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.

## Usage
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("your-username/Knesset-DictaBERT")
model = AutoModelForMaskedLM.from_pretrained("your-username/Knesset-DictaBERT")

model.eval()

sentence = "讛讻谞住转 讛讬讗 讛专砖讜转 [MASK] 砖诇 诪讚讬谞转 讬砖专讗诇."

# Tokenize the input sentence and get predictions
inputs = tokenizer.encode(sentence, return_tensors='pt')
output = model(inputs)

# The [MASK] token is the 5th token in the sentence (including [CLS])
mask_token_index = 5
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]

# Convert token IDs to tokens and print them
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))

# Example output: 讛诪讘爪注转 / 讛诪讞讜拽拽转

```

## Evaluation
The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
The perplexity was calculated on this full test set.
Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 3 million sentences (approximately 520 million tokens).

#### Perplexity
The perplexity of the original DictaBERT on the full test set is 22.87.
The perplexity of Knesset-DictaBERT on the full test set is 6.60.
#### Accuracy
- **1-accuracy results**

Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.
The original DictaBERT model achieved a top-1 accuracy of 48.02%.


- **2-accuracy results**

Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.
The original Dicta model achieved a top-2 accuracy of 58.60%.


- **5-accuracy results**
Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.
The original Dicta model achieved a top-5 accuracy of 68.98%.

## Acknowledgments
This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.

## Citation
If you use this model in your work, please cite:

@misc{Knesset-DictaBERT,
  author = {Gili Goldin},
  title = {Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/GiliGold/Knesset-DictaBERT}},
}