Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/aubmindlab/bert-base-arabert/README.md
README.md
ADDED
@@ -0,0 +1,144 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ar
|
3 |
+
---
|
4 |
+
|
5 |
+
# AraBERT : Pre-training BERT for Arabic Language Understanding
|
6 |
+
<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
|
7 |
+
|
8 |
+
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
|
9 |
+
|
10 |
+
There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
|
11 |
+
|
12 |
+
The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words. The training corpora are a collection of publically available large scale raw arabic text ([Arabic Wikidumps](https://archive.org/details/arwiki-20190201), [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4), [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619), Assafir news articles, and 4 other manually crawled news websites (Al-Akhbar, Annahar, AL-Ahram, AL-Wafd) from [the Wayback Machine](http://web.archive.org/))
|
13 |
+
|
14 |
+
We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
|
15 |
+
|
16 |
+
**Update 2 (21/5/2020) :**
|
17 |
+
Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
|
18 |
+
|
19 |
+
**Update 1 (21/4/2020) :**
|
20 |
+
Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
|
21 |
+
## Results (Acc.)
|
22 |
+
Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
|
23 |
+
---|:---:|:---:|:---:|:---:
|
24 |
+
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
|
25 |
+
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
|
26 |
+
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
|
27 |
+
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
|
28 |
+
LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
|
29 |
+
ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
|
30 |
+
ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
|
31 |
+
|
32 |
+
*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
|
33 |
+
|
34 |
+
## How to use
|
35 |
+
|
36 |
+
You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
|
37 |
+
|
38 |
+
To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
|
39 |
+
**Not all libraries built on top of transformers support the `never_split` argument**
|
40 |
+
```python
|
41 |
+
from transformers import AutoTokenizer, AutoModel
|
42 |
+
from arabert.preprocess_arabert import never_split_tokens, preprocess
|
43 |
+
from farasa.segmenter import FarasaSegmenter
|
44 |
+
|
45 |
+
arabert_tokenizer = AutoTokenizer.from_pretrained(
|
46 |
+
"aubmindlab/bert-base-arabert",
|
47 |
+
do_lower_case=False,
|
48 |
+
do_basic_tokenize=True,
|
49 |
+
never_split=never_split_tokens)
|
50 |
+
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
|
51 |
+
|
52 |
+
#Preprocess the text to make it compatible with AraBERT using farasapy
|
53 |
+
farasa_segmenter = FarasaSegmenter(interactive=True)
|
54 |
+
|
55 |
+
#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
|
56 |
+
#(see update 2)
|
57 |
+
#from py4j.java_gateway import JavaGateway
|
58 |
+
#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
|
59 |
+
#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
|
60 |
+
|
61 |
+
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|
62 |
+
text_preprocessed = preprocess( text,
|
63 |
+
do_farasa_tokenization = True,
|
64 |
+
farasa = farasa_segmenter,
|
65 |
+
use_farasapy = True)
|
66 |
+
|
67 |
+
>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
|
68 |
+
|
69 |
+
arabert_tokenizer.tokenize(text_preprocessed)
|
70 |
+
|
71 |
+
>>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
|
72 |
+
```
|
73 |
+
|
74 |
+
**AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
|
75 |
+
```python
|
76 |
+
from transformers import AutoTokenizer, AutoModel
|
77 |
+
|
78 |
+
arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
|
79 |
+
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
|
80 |
+
|
81 |
+
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|
82 |
+
arabert_tokenizer.tokenize(text)
|
83 |
+
|
84 |
+
>>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
|
85 |
+
```
|
86 |
+
|
87 |
+
|
88 |
+
The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
|
89 |
+
|
90 |
+
**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
|
91 |
+
|
92 |
+
**AraBERT on ARCD**
|
93 |
+
During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
|
94 |
+
```bash
|
95 |
+
python arcd_preprocessing.py \
|
96 |
+
--input_file="/PATH_TO/arcd-test.json" \
|
97 |
+
--output_file="arcd-test-pre.json" \
|
98 |
+
--do_farasa_tokenization=True \
|
99 |
+
--use_farasapy=True \
|
100 |
+
```
|
101 |
+
```bash
|
102 |
+
python SOQAL/bert/run_squad.py \
|
103 |
+
--vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
|
104 |
+
--bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
|
105 |
+
--init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
|
106 |
+
--do_train=True \
|
107 |
+
--train_file=turk_combined_all_pre.json \
|
108 |
+
--do_predict=True \
|
109 |
+
--predict_file=arcd-test-pre.json \
|
110 |
+
--train_batch_size=32 \
|
111 |
+
--predict_batch_size=24 \
|
112 |
+
--learning_rate=3e-5 \
|
113 |
+
--num_train_epochs=4 \
|
114 |
+
--max_seq_length=384 \
|
115 |
+
--doc_stride=128 \
|
116 |
+
--do_lower_case=False\
|
117 |
+
--output_dir="/PATH_TO/OUTPUT_PATH"/ \
|
118 |
+
--use_tpu=True \
|
119 |
+
--tpu_name=$TPU_ADDRESS \
|
120 |
+
```
|
121 |
+
## Model Weights and Vocab Download
|
122 |
+
Models | AraBERTv0.1 | AraBERTv1
|
123 |
+
---|:---:|:---:
|
124 |
+
TensorFlow|[Drive Link](https://drive.google.com/open?id=1-kVmTUZZ4DP2rzeHNjTPkY8OjnQCpomO) | [Drive Link](https://drive.google.com/open?id=1-d7-9ljKgDJP5mx73uBtio-TuUZCqZnt)
|
125 |
+
PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7yaJH-IOv)| [Drive_Link](https://drive.google.com/open?id=1-69s6Pxqbi63HOQ1M9wTcr-Ovc6PWLLo)
|
126 |
+
|
127 |
+
**You can find the PyTorch models in HuggingFace's Transformer Library under the ```aubmindlab``` username**
|
128 |
+
|
129 |
+
## If you used this model please cite us as:
|
130 |
+
```
|
131 |
+
@inproceedings{antoun2020arabert,
|
132 |
+
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
|
133 |
+
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
|
134 |
+
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
|
135 |
+
pages={9}
|
136 |
+
}
|
137 |
+
```
|
138 |
+
## Acknowledgments
|
139 |
+
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
|
140 |
+
|
141 |
+
## Contacts
|
142 |
+
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|
143 |
+
|
144 |
+
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
|