|
--- |
|
license: cc-by-nc-nd-4.0 |
|
datasets: |
|
- taln-ls2n/Adminset |
|
language: |
|
- fr |
|
library_name: transformers |
|
tags: |
|
- camembert |
|
- BERT |
|
- Administrative documents |
|
--- |
|
|
|
# AdminBERT 4GB: A Small French Language model adapted to Administrative documents |
|
|
|
[AdminBERT-4GB](example) is a French language model adapted on a large corpus of 10 millions French administrative texts. It is a derivative of CamemBERT model, which is based on the RoBERTa architecture. AdminBERT-4GB is trained using the Whole Word Masking (WWM) objective with 30% mask rate for 2 epochs on 8 V100 GPUs. The dataset used for training is a sample of [Adminset](https://huggingface.co/datasets/taln-ls2n/Adminset). |
|
|
|
|
|
## Evaluation |
|
|
|
Regarding the fact that at date, there was no evaluation coprus available compose of French administrative, we decide to create our own on the NER (Named Entity Recognition) task. |
|
|
|
### Model Performance |
|
|
|
| Model | P (%) | R (%) | F1 (%) | |
|
|------------------------|---------|---------|---------| |
|
| Wikineural-NER FT | 77.49 | 75.40 | 75.70 | |
|
| NERmemBERT-Large FT | 77.43 | 78.38 | 77.13 | |
|
| CamemBERT FT | 77.62 | 79.59 | 77.26 | |
|
| NERmemBERT-Base FT | 77.99 | 79.59 | 78.34 | |
|
| AdminBERT-NER 4G | 78.47 | 80.35 | 79.26 | |
|
| AdminBERT-NER 16GB | 78.79 | 82.07 | 80.11 | |
|
|
|
To evaluate each model, we performed five runs and averaged the results on the test set of [Adminset-NER](https://huggingface.co/datasets/taln-ls2n/Adminset-NER). |