QARiB: QCRI Arabic and Dialectal BERT
About QARiB Farasa
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
For the tweets, the data was collected using twitter API and using language filter. lang:ar
. For the text data, it was a combination from
Arabic GigaWord, Abulkhair Arabic Corpus and OPUS.
QARiB: Is the Arabic name for "Boat".
Model and Parameters:
- Data size: 14B tokens
- Vocabulary: 64k
- Iterations: 10M
- Number of Layers: 12
Training QARiB
See details in Training QARiB
Using QARiB
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see Using QARiB
This model expects the data to be segmented. You may use Farasa Segmenter API.
How to use
You can use this model directly with a pipeline for masked language modeling:
>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/bert-base-qarib_far")
>>> fill_mask("و+قام ال+مدير [MASK]")
>>> fill_mask("و+قام+ت ال+مدير+ة [MASK]")
>>> fill_mask("قللي وشفيييك يرحم [MASK]")
Evaluations:
Model Weights and Vocab Download
From Huggingface site: https://huggingface.co/qarib/bert-base-qarib_far
Contacts
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
Reference
@article{abdelali2021pretraining,
title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
year={2021},
eprint={2102.10684},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 9