Knesset-DictaBERT / README.md
GiliGold's picture
Update README.md (#1)
49108b4 verified
metadata
license: cc-by-sa-4.0
datasets:
  - HaifaCLGroup/KnessetCorpus
language:
  - he
tags:
  - hebrew
  - nlp
  - masked-language-model
  - transformers
  - BERT
  - parliamentary-proceedings
  - language-model
  - Knesset
  - DictaBERT
  - fine-tuning

Knesset-DictaBERT

Knesset-DictaBERT is a Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings.

This model is based on the Dicta-BERT architecture and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.

Model Details

  • Model type: BERT-based (Bidirectional Encoder Representations from Transformers)
  • Language: Hebrew
  • Training Data: Knesset Corpus (Israeli parliamentary proceedings)
  • Base Model: Dicta-BERT

Training Procedure

The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT")
model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT")
model.eval()
sentence = "ื™ืฉ ืœื ื• [MASK] ืขืœ ื–ื” ื‘ืฉื‘ื•ืข ื”ื‘ื"

# Tokenize the input sentence and get predictions
inputs = tokenizer.encode(sentence, return_tensors='pt')
output = model(inputs)

mask_token_index = 3
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]

# Convert token IDs to tokens and print them
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))

# Example output: ื™ืฉื™ื‘ื” / ื“ื™ื•ืŸ

Evaluation

The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences. The perplexity was calculated on this full test set. Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 300,000 sentences (approximately 3.5 million tokens).

Perplexity

The perplexity of the original DictaBERT on the full test set is 22.87.

The perplexity of Knesset-DictaBERT on the full test set is 6.60.

Accuracy

  • 1-accuracy results

Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.

The original DictaBERT model achieved a top-1 accuracy of 48.02%.

  • 2-accuracy results

Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.

The original DictaBERT model achieved a top-2 accuracy of 58.60%.

  • 5-accuracy results
  • Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.

The original DictaBERT model achieved a top-5 accuracy of 68.98%.

Acknowledgments

This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.

Citation

If you use this model in your work, please cite:

@misc{goldin2024knessetdictaberthebrewlanguagemodel,
      title={Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings}, 
      author={Gili Goldin and Shuly Wintner},
      year={2024},
      eprint={2407.20581},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.20581}, 
}