create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: fr
|
3 |
+
datasets:
|
4 |
+
- piaf
|
5 |
+
- FQuAD
|
6 |
+
- SQuAD-FR
|
7 |
+
---
|
8 |
+
|
9 |
+
# camembert-base-squadFR-fquad-piaf
|
10 |
+
|
11 |
+
## Description
|
12 |
+
|
13 |
+
French [DPR model](https://arxiv.org/abs/2004.04906) using [CamemBERT](https://arxiv.org/abs/1911.03894) as base and then fine-tuned on a combo of three French Q&A
|
14 |
+
## Data
|
15 |
+
### French Q&A
|
16 |
+
We use a combination of three French Q&A datasets:
|
17 |
+
|
18 |
+
1. [PIAFv1.1](https://www.data.gouv.fr/en/datasets/piaf-le-dataset-francophone-de-questions-reponses/)
|
19 |
+
2. [FQuADv1.0](https://fquad.illuin.tech/)
|
20 |
+
3. [SQuAD-FR (SQuAD automatically translated to French)](https://github.com/Alikabbadj/French-SQuAD)
|
21 |
+
|
22 |
+
### Training
|
23 |
+
|
24 |
+
|
25 |
+
We are using 90 562 random questions for `train` and 22 391 for `dev`. No question in `train` exists in `dev`. For each question, we have a single `positive_context` (the paragraph where the answer to this question is found) and around 30 `hard_negtive_contexts`. Hard negative contexts are found by querying an ES instance (via bm25 retrieval) and getting the top-k candidates **that do not contain the answer**.
|
26 |
+
|
27 |
+
The corresponding files are here:
|
28 |
+
|
29 |
+
### Evaluation
|
30 |
+
|
31 |
+
|
32 |
+
We use FQuADv1.0 and French-SQuAD evaluation sets.
|
33 |
+
|
34 |
+
|
35 |
+
## Training Script
|
36 |
+
We use the official [Facebook DPR implentation](https://github.com/facebookresearch/DPR) with a slight modification: by default, the code can work with Roberta models, still we changed a single line to make it easier to work with Camembert. This modification can be found [over here](dpr fork).
|
37 |
+
|
38 |
+
### Hyperparameters
|
39 |
+
|
40 |
+
```shell
|
41 |
+
python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
|
42 |
+
--max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_file data/bert-base-multilingual-uncased \
|
43 |
+
--seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 16 --do_lower_case \
|
44 |
+
--train_file DPR_FR_train.json \
|
45 |
+
--dev_file ./data/100_hard_neg_ctxs/DPR_FR_dev.json \
|
46 |
+
--output_dir ./output/bert --learning_rate 2e-05 --num_train_epochs 35 \
|
47 |
+
--dev_batch_size 16 --val_av_rank_start_epoch 25 \
|
48 |
+
--pretrained_model_cfg ./data/bert-base-multilingual-uncased
|
49 |
+
```
|
50 |
+
|
51 |
+
###
|
52 |
+
|
53 |
+
## Evaluation results
|
54 |
+
We obtain the following evaluation by using FQuAD and SQuAD-FR evaluation (or validation) sets. To obtain these results, we use [haystack's evaluation script](https://github.com/deepset-ai/haystack/blob/db4151bbc026f27c6d709fefef1088cd3f1e18b9/tutorials/Tutorial5_Evaluation.py) (**we report Retrieval results only**).
|
55 |
+
|
56 |
+
### DPR
|
57 |
+
|
58 |
+
#### FQuAD v1.0 Evaluation
|
59 |
+
```shell
|
60 |
+
For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
|
61 |
+
Retriever Recall: 0.87
|
62 |
+
Retriever Mean Avg Precision: 0.57
|
63 |
+
```
|
64 |
+
#### SQuAD-FR Evaluation
|
65 |
+
```shell
|
66 |
+
For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
|
67 |
+
Retriever Recall: 0.89
|
68 |
+
Retriever Mean Avg Precision: 0.63
|
69 |
+
```
|
70 |
+
|
71 |
+
### BM25
|
72 |
+
|
73 |
+
|
74 |
+
For reference, BM25 gets the results shown below. As in the original paper, regarding SQuAD-like datasets, the results of DPR are consistently superseeded by BM25.
|
75 |
+
|
76 |
+
#### FQuAD v1.0 Evaluation
|
77 |
+
```shell
|
78 |
+
For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
|
79 |
+
Retriever Recall: 0.93
|
80 |
+
Retriever Mean Avg Precision: 0.74
|
81 |
+
```
|
82 |
+
#### SQuAD-FR Evaluation
|
83 |
+
```shell
|
84 |
+
For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
|
85 |
+
Retriever Recall: 0.93
|
86 |
+
Retriever Mean Avg Precision: 0.77
|
87 |
+
```
|
88 |
+
|
89 |
+
## Usage
|
90 |
+
|
91 |
+
The results reported here are obtained with the `haystack` library. To get to similar embeddings using exclusively HF `transformers` library, you can do the following:
|
92 |
+
|
93 |
+
```python
|
94 |
+
from transformers import AutoTokenizer, AutoModel
|
95 |
+
query = "Salut, mon chien est-il mignon ?"
|
96 |
+
tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", do_lower_case=True)
|
97 |
+
input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
|
98 |
+
model = AutoModel.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", return_dict=True)
|
99 |
+
embeddings = model.forward(input_ids).pooler_output
|
100 |
+
print(embeddings)
|
101 |
+
```
|
102 |
+
|
103 |
+
And with `haystack` (using `transformers-3.3.1`), we use it as a retriever (**note that we reference it from a local path**):
|
104 |
+
```
|
105 |
+
retriever = DensePassageRetriever(document_store=document_store,
|
106 |
+
query_embedding_model="./etalab-ia/dpr-question_encoder-fr_qa-camembert",
|
107 |
+
passage_embedding_model="./etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
|
108 |
+
use_gpu=True,
|
109 |
+
embed_title=False,
|
110 |
+
batch_size=16,
|
111 |
+
use_fast_tokenizers=False
|
112 |
+
)
|
113 |
+
```
|
114 |
+
## Acknoledgements
|
115 |
+
|
116 |
+
This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224).
|
117 |
+
|
118 |
+
|
119 |
+
## Citations
|
120 |
+
|
121 |
+
### Datasets
|
122 |
+
|
123 |
+
#### PIAF
|
124 |
+
```
|
125 |
+
@inproceedings{KeraronLBAMSSS20,
|
126 |
+
author = {Rachel Keraron and
|
127 |
+
Guillaume Lancrenon and
|
128 |
+
Mathilde Bras and
|
129 |
+
Fr{\'{e}}d{\'{e}}ric Allary and
|
130 |
+
Gilles Moyse and
|
131 |
+
Thomas Scialom and
|
132 |
+
Edmundo{-}Pavel Soriano{-}Morales and
|
133 |
+
Jacopo Staiano},
|
134 |
+
title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
|
135 |
+
booktitle = {{LREC}},
|
136 |
+
pages = {5481--5490},
|
137 |
+
publisher = {European Language Resources Association},
|
138 |
+
year = {2020}
|
139 |
+
}
|
140 |
+
|
141 |
+
```
|
142 |
+
|
143 |
+
#### FQuAD
|
144 |
+
```
|
145 |
+
@article{dHoffschmidt2020FQuADFQ,
|
146 |
+
title={FQuAD: French Question Answering Dataset},
|
147 |
+
author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
|
148 |
+
journal={ArXiv},
|
149 |
+
year={2020},
|
150 |
+
volume={abs/2002.06071}
|
151 |
+
}
|
152 |
+
```
|
153 |
+
|
154 |
+
#### SQuAD-FR
|
155 |
+
```
|
156 |
+
@MISC{kabbadj2018,
|
157 |
+
author = "Kabbadj, Ali",
|
158 |
+
title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
|
159 |
+
editor = "linkedin.com",
|
160 |
+
month = "November",
|
161 |
+
year = "2018",
|
162 |
+
url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
|
163 |
+
note = "[Online; posted 11-November-2018]",
|
164 |
+
}
|
165 |
+
```
|
166 |
+
### Models
|
167 |
+
|
168 |
+
#### CamemBERT
|
169 |
+
HF model card : [https://huggingface.co/camembert-base](https://huggingface.co/camembert-base)
|
170 |
+
|
171 |
+
```
|
172 |
+
@inproceedings{martin2020camembert,
|
173 |
+
title={CamemBERT: a Tasty French Language Model},
|
174 |
+
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
|
175 |
+
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
|
176 |
+
year={2020}
|
177 |
+
}
|
178 |
+
```
|
179 |
+
|
180 |
+
#### DPR
|
181 |
+
|
182 |
+
```
|
183 |
+
@misc{karpukhin2020dense,
|
184 |
+
title={Dense Passage Retrieval for Open-Domain Question Answering},
|
185 |
+
author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
|
186 |
+
year={2020},
|
187 |
+
eprint={2004.04906},
|
188 |
+
archivePrefix={arXiv},
|
189 |
+
primaryClass={cs.CL}
|
190 |
+
}
|
191 |
+
```
|
192 |
+
|
193 |
+
|