language: fr
datasets:
- piaf
- FQuAD
- SQuAD-FR
dpr-question_encoder-fr_qa-camembert
Description
French DPR model using CamemBERT as base and then fine-tuned on a combo of three French Q&A
Data
French Q&A
We use a combination of three French Q&A datasets:
Training
We are using 90 562 random questions for train
and 22 391 for dev
. No question in train
exists in dev
. For each question, we have a single positive_context
(the paragraph where the answer to this question is found) and around 30 hard_negtive_contexts
. Hard negative contexts are found by querying an ES instance (via bm25 retrieval) and getting the top-k candidates that do not contain the answer.
The files are over here.
Evaluation
We use FQuADv1.0 and French-SQuAD evaluation sets.
Training Script
We use the official Facebook DPR implentation with a slight modification: by default, the code can work with Roberta models, still we changed a single line to make it easier to work with Camembert. This modification can be found over here.
Hyperparameters
python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
--max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_file data/bert-base-multilingual-uncased \
--seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 16 --do_lower_case \
--train_file DPR_FR_train.json \
--dev_file ./data/100_hard_neg_ctxs/DPR_FR_dev.json \
--output_dir ./output/bert --learning_rate 2e-05 --num_train_epochs 35 \
--dev_batch_size 16 --val_av_rank_start_epoch 25 \
--pretrained_model_cfg ./data/bert-base-multilingual-uncased
Evaluation results
We obtain the following evaluation by using FQuAD and SQuAD-FR evaluation (or validation) sets. To obtain these results, we use haystack's evaluation script (we report Retrieval results only).
DPR
FQuAD v1.0 Evaluation
For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.87
Retriever Mean Avg Precision: 0.57
SQuAD-FR Evaluation
For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.89
Retriever Mean Avg Precision: 0.63
BM25
For reference, BM25 gets the results shown below. As in the original paper, regarding SQuAD-like datasets, the results of DPR are consistently superseeded by BM25.
FQuAD v1.0 Evaluation
For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.93
Retriever Mean Avg Precision: 0.74
SQuAD-FR Evaluation
For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.93
Retriever Mean Avg Precision: 0.77
Usage
The results reported here are obtained with the haystack
library. To get to similar embeddings using exclusively HF transformers
library, you can do the following:
from transformers import AutoTokenizer, AutoModel
query = "Salut, mon chien est-il mignon ?"
tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", do_lower_case=True)
input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
model = AutoModel.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", return_dict=True)
embeddings = model.forward(input_ids).pooler_output
print(embeddings)
And with haystack
(using transformers-3.3.1
), we use it as a retriever (note that we reference it from a local path):
retriever = DensePassageRetriever(document_store=document_store,
query_embedding_model="./etalab-ia/dpr-question_encoder-fr_qa-camembert",
passage_embedding_model="./etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
use_gpu=True,
embed_title=False,
batch_size=16,
use_fast_tokenizers=False
)
Acknowledgments
This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224).
Citations
Datasets
PIAF
@inproceedings{KeraronLBAMSSS20,
author = {Rachel Keraron and
Guillaume Lancrenon and
Mathilde Bras and
Fr{\'{e}}d{\'{e}}ric Allary and
Gilles Moyse and
Thomas Scialom and
Edmundo{-}Pavel Soriano{-}Morales and
Jacopo Staiano},
title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
booktitle = {{LREC}},
pages = {5481--5490},
publisher = {European Language Resources Association},
year = {2020}
}
FQuAD
@article{dHoffschmidt2020FQuADFQ,
title={FQuAD: French Question Answering Dataset},
author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
journal={ArXiv},
year={2020},
volume={abs/2002.06071}
}
SQuAD-FR
@MISC{kabbadj2018,
author = "Kabbadj, Ali",
title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
editor = "linkedin.com",
month = "November",
year = "2018",
url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
note = "[Online; posted 11-November-2018]",
}
Models
CamemBERT
HF model card : https://huggingface.co/camembert-base
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
DPR
@misc{karpukhin2020dense,
title={Dense Passage Retrieval for Open-Domain Question Answering},
author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
year={2020},
eprint={2004.04906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}