Upload 7 files
Browse filesTrained on arabic squad-v2
- README.md +94 -0
- config.json +31 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +20 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- ZeyadAhmed/Arabic-SQuADv2.0
|
4 |
+
language:
|
5 |
+
- ar
|
6 |
+
metrics:
|
7 |
+
-
|
8 |
+
name: exact_match
|
9 |
+
type: exact_match
|
10 |
+
value: 65.12
|
11 |
+
-
|
12 |
+
name: F1
|
13 |
+
type: f1
|
14 |
+
value: 71.49
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
# AraElectra for Question Answering on Arabic-SQuADv2
|
19 |
+
|
20 |
+
This is the [AraElectra](https://huggingface.co/aubmindlab/araelectra-base-discriminator) model, fine-tuned using the [Arabic-SQuADv2.0](https://huggingface.co/datasets/ZeyadAhmed/Arabic-SQuADv2.0) dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering. with help of [AraElectra Classifier](https://huggingface.co/ZeyadAhmed/AraElectra-Arabic-SQuADv2-CLS) to predicted unanswerable question.
|
21 |
+
|
22 |
+
## Overview
|
23 |
+
**Language model:** AraElectra <br>
|
24 |
+
**Language:** Arabic <br>
|
25 |
+
**Downstream-task:** Extractive QA
|
26 |
+
**Training data:** Arabic-SQuADv2.0
|
27 |
+
**Eval data:** Arabic-SQuADv2.0 <br>
|
28 |
+
**Test data:** Arabic-SQuADv2.0 <br>
|
29 |
+
**Code:** [See More Info on Github](https://github.com/zeyadahmed10/Arabic-MRC)
|
30 |
+
**Infrastructure**: 1x Tesla K80
|
31 |
+
|
32 |
+
## Hyperparameters
|
33 |
+
|
34 |
+
```
|
35 |
+
batch_size = 8
|
36 |
+
n_epochs = 4
|
37 |
+
base_LM_model = "AraElectra"
|
38 |
+
learning_rate = 3e-5
|
39 |
+
optimizer = AdamW
|
40 |
+
padding = dynamic
|
41 |
+
```
|
42 |
+
|
43 |
+
## Online Demo on Arabic Wikipedia and User Provided Contexts
|
44 |
+
See model in action hosted on streamlit [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/wissamantoun/arabic-wikipedia-qa-streamlit/main)
|
45 |
+
|
46 |
+
## Usage
|
47 |
+
For best results use the AraBert [preprocessor](https://github.com/aub-mind/arabert/blob/master/preprocess.py) by aub-mind
|
48 |
+
```python
|
49 |
+
from transformers import ElectraForQuestionAnswering, ElectraForSequenceClassification, AutoTokenizer, pipeline
|
50 |
+
from preprocess import ArabertPreprocessor
|
51 |
+
prep_object = ArabertPreprocessor("araelectra-base-discriminator")
|
52 |
+
question = prep_object('ما هي جامعة الدول العربية ؟')
|
53 |
+
context = prep_object('''
|
54 |
+
جامعة الدول العربية هيمنظمة إقليمية تضم دولاً عربية في آسيا وأفريقيا.
|
55 |
+
ينص ميثاقها على التنسيق بين الدول الأعضاء في الشؤون الاقتصادية، ومن ضمنها العلاقات التجارية الاتصالات، العلاقات الثقافية، الجنسيات ووثائق وأذونات السفر والعلاقات الاجتماعية والصحة. المقر الدائم لجامعة الدول العربية يقع في القاهرة، عاصمة مصر (تونس من 1979 إلى 1990).
|
56 |
+
''')
|
57 |
+
# a) Get predictions
|
58 |
+
qa_modelname = 'ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA'
|
59 |
+
cls_modelname = 'ZeyadAhmed/AraElectra-Arabic-SQuADv2-CLS'
|
60 |
+
qa_pipe = pipeline('question-answering', model=qa_modelname, tokenizer=qa_modelname)
|
61 |
+
QA_input = {
|
62 |
+
'question': question,
|
63 |
+
'context': context
|
64 |
+
}
|
65 |
+
CLS_input = {
|
66 |
+
'text': question,
|
67 |
+
'text_pair': context
|
68 |
+
}
|
69 |
+
qa_res = qa_pipe(QA_input)
|
70 |
+
cls_res = cls_pipe(CLS_iput)
|
71 |
+
threshold = 0.5 #hyperparameter can be tweaked
|
72 |
+
## note classification results label0 probability it can be answered label1 probability can't be answered
|
73 |
+
## if label1 probability > threshold then consider the output of qa_res is empty string else take the qa_res
|
74 |
+
# b) Load model & tokenizer
|
75 |
+
qa_model = ElectraForQuestionAnswering.from_pretrained(qa_modelname)
|
76 |
+
cls_model = ElectraForSequenceClassification.from_pretrained(cls_modelname)
|
77 |
+
tokenizer = AutoTokenizer.from_pretrained(qa_modelname)
|
78 |
+
```
|
79 |
+
|
80 |
+
## Performance
|
81 |
+
Evaluated on the Arabic-SQuAD 2.0 test set with the [official eval script](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) except changing in the preprocessing a little to fit the arabic language [the modified eval script](https://github.com/zeyadahmed10/Arabic-MRC/blob/main/evaluatev2.py).
|
82 |
+
|
83 |
+
```
|
84 |
+
"exact": 65.11555277951281,
|
85 |
+
"f1": 71.49042547237256,,
|
86 |
+
|
87 |
+
"total": 9606,
|
88 |
+
"HasAns_exact": 56.14535768645358,
|
89 |
+
"HasAns_f1": 67.79623803036668,
|
90 |
+
"HasAns_total": 5256,
|
91 |
+
"NoAns_exact": 75.95402298850574,
|
92 |
+
"NoAns_f1": 75.95402298850574,
|
93 |
+
"NoAns_total": 4350
|
94 |
+
```
|
config.json
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/content/drive/MyDrive/AraElectra-ASQuADv2-QA",
|
3 |
+
"architectures": [
|
4 |
+
"ElectraForQuestionAnswering"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"embedding_size": 768,
|
9 |
+
"generator_hidden_size": 0.33333,
|
10 |
+
"hidden_act": "gelu",
|
11 |
+
"hidden_dropout_prob": 0.1,
|
12 |
+
"hidden_size": 768,
|
13 |
+
"initializer_range": 0.02,
|
14 |
+
"intermediate_size": 3072,
|
15 |
+
"layer_norm_eps": 1e-12,
|
16 |
+
"max_position_embeddings": 512,
|
17 |
+
"model_type": "electra",
|
18 |
+
"num_attention_heads": 12,
|
19 |
+
"num_hidden_layers": 12,
|
20 |
+
"pad_token_id": 0,
|
21 |
+
"position_embedding_type": "absolute",
|
22 |
+
"summary_activation": "gelu",
|
23 |
+
"summary_last_dropout": 0.1,
|
24 |
+
"summary_type": "first",
|
25 |
+
"summary_use_proj": true,
|
26 |
+
"torch_dtype": "float32",
|
27 |
+
"transformers_version": "4.20.1",
|
28 |
+
"type_vocab_size": 2,
|
29 |
+
"use_cache": true,
|
30 |
+
"vocab_size": 64000
|
31 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:26c01a5524d2fcc36a627877a8ff9f3f02d5ad00b334a7f387b5def64dd47cff
|
3 |
+
size 538485425
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"do_basic_tokenize": true,
|
4 |
+
"do_lower_case": false,
|
5 |
+
"mask_token": "[MASK]",
|
6 |
+
"max_len": 512,
|
7 |
+
"name_or_path": "/content/drive/MyDrive/AraElectra-ASQuADv2-QA",
|
8 |
+
"never_split": [
|
9 |
+
"[بريد]",
|
10 |
+
"[مستخدم]",
|
11 |
+
"[رابط]"
|
12 |
+
],
|
13 |
+
"pad_token": "[PAD]",
|
14 |
+
"sep_token": "[SEP]",
|
15 |
+
"special_tokens_map_file": null,
|
16 |
+
"strip_accents": null,
|
17 |
+
"tokenize_chinese_chars": true,
|
18 |
+
"tokenizer_class": "ElectraTokenizer",
|
19 |
+
"unk_token": "[UNK]"
|
20 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|