|
--- |
|
license: mit |
|
language: fr |
|
datasets: |
|
- ccnet |
|
tags: |
|
- deberta |
|
- deberta-v3 |
|
inference: false |
|
--- |
|
|
|
# CamemBERTa: A French language model based on DeBERTa V3 |
|
|
|
CamemBERTa, a French language model based on DeBERTa V3, which is a DeBerta V2 with ELECTRA style pretraining using the Replaced Token Detection (RTD) objective. |
|
RTD uses a generator model, trained using the MLM objective, to replace masked tokens with plausible candidates, and a discriminator model trained to detect which tokens were replaced by the generator. |
|
Usually the generator and discriminator share the same embedding matrix, but the authors of DeBERTa V3 propose a new technique to disentagle the gradients of the shared embedding between the generator and discriminator called gradient-disentangled embedding sharing (GDES) |
|
|
|
*This the first publicly available implementation of DeBERTa V3, and the first publicly DeBERTaV3 model outside of the original Microsoft release.* |
|
|
|
Preprint Paper: https://inria.hal.science/hal-03963729/ |
|
|
|
Pre-training Code: https://gitlab.inria.fr/almanach/CamemBERTa |
|
|
|
## How to use CamemBERTa |
|
Our pretrained weights are available on the HuggingFace model hub, you can load them using the following code: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM |
|
|
|
CamemBERTa = AutoModel.from_pretrained("almanach/camemberta-base") |
|
tokenizer = AutoTokenizer.from_pretrained("almanach/camemberta-base") |
|
|
|
CamemBERTa_gen = AutoModelForMaskedLM.from_pretrained("almanach/camemberta-base-generator") |
|
tokenizer_gen = AutoTokenizer.from_pretrained("almanach/camemberta-base-generator") |
|
``` |
|
|
|
We also include the TF2 weights including the weights for the model's RTD head for the discriminator, and the MLM head for the generator. |
|
CamemBERTa is compatible with most finetuning scripts from the transformers library. |
|
|
|
## Pretraining Setup |
|
|
|
The model was trained on the French subset of the CCNet corpus (the same subset used in CamemBERT and PaGNOL) and is available on the HuggingFace model hub: CamemBERTa and CamemBERTa Generator. |
|
To speed up the pre-training experiments, the pre-training was split into two phases; |
|
in phase 1, the model is trained with a maximum sequence length of 128 tokens for 10,000 steps with 2,000 warm-up steps and a very large batch size of 67,584. |
|
In phase 2, maximum sequence length is increased to the full model capacity of 512 tokens for 3,300 steps with 200 warm-up steps and a batch size of 27,648. |
|
The model would have seen 133B tokens compared to 419B tokens for CamemBERT-CCNet which was trained for 100K steps, this represents roughly 30% of CamemBERT’s full training. |
|
To have a fair comparison, we trained a RoBERTa model, CamemBERT30%, using the same exact pretraining setup but with the MLM objective. |
|
|
|
## Pretraining Loss Curves |
|
check the tensorboard logs and plots |
|
|
|
## Fine-tuning results |
|
|
|
Datasets: POS tagging and Dependency Parsing (GSD, Rhapsodie, Sequoia, FSMB), NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS-X), and the French Question Answering Dataset (FQuAD) |
|
|
|
| Model | UPOS | LAS | NER | CLS | PAWS-X | XNLI | F1 (FQuAD) | EM (FQuAD) | |
|
|-------------------|-----------|-----------|-----------|-----------|-----------|-----------|------------|------------| |
|
| CamemBERT (CCNet) | **97.59** | **88.69** | 89.97 | 94.62 | 91.36 | 81.95 | 80.98 | **62.51** | |
|
| CamemBERT (30%) | 97.53 | 87.98 | **91.04** | 93.28 | 88.94 | 79.89 | 75.14 | 56.19 | |
|
| CamemBERTa | 97.57 | 88.55 | 90.33 | **94.92** | **91.67** | **82.00** | **81.15** | 62.01 | |
|
|
|
The following table compares CamemBERTa's performance on XNLI against other models under different training setups, which demonstrates the data efficiency of CamemBERTa. |
|
|
|
|
|
| Model | XNLI (Acc.) | Training Steps | Tokens seen in pre-training | Dataset Size in Tokens | |
|
|-------------------|-------------|----------------|-----------------------------|------------------------| |
|
| mDeBERTa | 84.4 | 500k | 2T | 2.5T | |
|
| CamemBERTa | 82.0 | 33k | 0.139T | 0.319T | |
|
| XLM-R | 81.4 | 1.5M | 6T | 2.5T | |
|
| CamemBERT - CCNet | 81.95 | 100k | 0.419T | 0.319T | |
|
|
|
*Note: The CamemBERTa training steps was adjusted for a batch size of 8192.* |
|
|
|
## License |
|
|
|
The public model weights are licensed under MIT License. |
|
This code is licensed under the Apache License 2.0. |
|
|
|
## Citation |
|
|
|
Paper accepted to Findings of ACL 2023. |
|
|
|
You can use the preprint citation for now |
|
|
|
``` |
|
@article{antoun2023camemberta |
|
TITLE = {{Data-Efficient French Language Modeling with CamemBERTa}}, |
|
AUTHOR = {Antoun, Wissam and Sagot, Beno{\^i}t and Seddah, Djam{\'e}}, |
|
URL = {https://inria.hal.science/hal-03963729}, |
|
NOTE = {working paper or preprint}, |
|
YEAR = {2023}, |
|
MONTH = Jan, |
|
PDF = {https://inria.hal.science/hal-03963729/file/French_DeBERTa___ACL_2023%20to%20be%20uploaded.pdf}, |
|
HAL_ID = {hal-03963729}, |
|
HAL_VERSION = {v1}, |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
Wissam Antoun: `wissam (dot) antoun (at) inria (dot) fr` |
|
|
|
Benoit Sagot: `benoit (dot) sagot (at) inria (dot) fr` |
|
|
|
Djame Seddah: `djame (dot) seddah (at) inria (dot) fr` |