File size: 3,600 Bytes
091e64e dd9222b 091e64e fe37b82 091e64e fe37b82 091e64e fe37b82 091e64e 4a98108 091e64e e1fcce1 091e64e f630abd 091e64e f630abd 091e64e 23c00f5 091e64e 23c00f5 091e64e f630abd 091e64e 23c00f5 091e64e 360eb29 be7d3d1 360eb29 be7d3d1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
license: cc-by-sa-4.0
datasets:
- HaifaCLGroup/KnessetCorpus
language:
- he
tags:
- hebrew
- nlp
- masked-language-model
- transformers
- BERT
- parliamentary-proceedings
- language-model
- Knesset
- DictaBERT
- fine-tuning
---
# Knesset-DictaBERT
**Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus),
which comprises Israeli parliamentary proceedings.
This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture
and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.
## Model Details
- **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers)
- **Language**: Hebrew
- **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
- **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)
## Training Procedure
The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.
## Usage
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT")
model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT")
model.eval()
sentence = "ืืฉ ืื ื [MASK] ืขื ืื ืืฉืืืข ืืื"
# Tokenize the input sentence and get predictions
inputs = tokenizer.encode(sentence, return_tensors='pt')
output = model(inputs)
mask_token_index = 3
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]
# Convert token IDs to tokens and print them
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))
# Example output: ืืฉืืื / ืืืื
```
## Evaluation
The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
The perplexity was calculated on this full test set.
Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 300,000 sentences (approximately 3.5 million tokens).
#### Perplexity
The perplexity of the original DictaBERT on the full test set is 22.87.
The perplexity of Knesset-DictaBERT on the full test set is 6.60.
#### Accuracy
- **1-accuracy results**
Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.
The original DictaBERT model achieved a top-1 accuracy of 48.02%.
- **2-accuracy results**
Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.
The original DictaBERT model achieved a top-2 accuracy of 58.60%.
- **5-accuracy results**
-
Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.
The original DictaBERT model achieved a top-5 accuracy of 68.98%.
## Acknowledgments
This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.
## Citation
If you use this model in your work, please cite:
```bibtex
@misc{goldin2024knessetdictaberthebrewlanguagemodel,
title={Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
author={Gili Goldin and Shuly Wintner},
year={2024},
eprint={2407.20581},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.20581},
}
```
|