AdminBERT 4GB: A Small French Language model adapted to Administrative documents

AdminBERT-4GB is a French language model adapted on a large corpus of 10 millions French administrative texts. It is a derivative of CamemBERT model, which is based on the RoBERTa architecture. AdminBERT-4GB is trained using the Whole Word Masking (WWM) objective with 30% mask rate for 2 epochs on 8 V100 GPUs. The dataset used for training is a sample of Adminset.

Evaluation

Regarding the fact that at date, there was no evaluation coprus available compose of French administrative documents, we decide to create our own on the NER (Named Entity Recognition) task.

Model Performance

Model P (%) R (%) F1 (%)
Wikineural-NER FT 77.49 75.40 75.70
NERmemBERT-Large FT 77.43 78.38 77.13
CamemBERT FT 77.62 79.59 77.26
NERmemBERT-Base FT 77.99 79.59 78.34
AdminBERT-NER 4G 78.47 80.35 79.26
AdminBERT-NER 16GB 78.79 82.07 80.11

To evaluate each model, we performed five runs and averaged the results on the test set of Adminset-NER.

Citation

If you use this dataset, please cite the following paper:

@inproceedings{sebbag-etal-2025-adminset,
    title = "{A}dmin{S}et and {A}dmin{BERT}: a Dataset and a Pre-trained Language Model to Explore the Unstructured Maze of {F}rench Administrative Documents",
    author = "Sebbag, Thomas  and
      Quiniou, Solen  and
      Stucky, Nicolas  and
      Morin, Emmanuel",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.27/",
    pages = "392--406",
    abstract = "In recent years, Pre-trained Language Models(PLMs) have been widely used to analyze various documents, playing a crucial role in Natural Language Processing (NLP). However, administrative texts have rarely been used in information extraction tasks, even though this resource is available as open data in many countries. Most of these texts contain many specific domain terms. Moreover, especially in France, they are unstructured because many administrations produce them without a standardized framework. Due to this fact, current language models do not process these documents correctly. In this paper, we propose AdminBERT, the first French pre-trained language models for the administrative domain. Since interesting information in such texts corresponds to named entities and the relations between them, we compare this PLM with general domain language models, fine-tuned on the Named Entity Recognition (NER) task applied to administrative texts, as well as to a Large Language Model (LLM) and to a language model with an architecture different from the BERT one. We show that taking advantage of a PLM for French administrative data increases the performance in the administrative and general domains, on these texts. We also release AdminBERT as well as AdminSet, the pre-training corpus of administrative texts in French and the subset AdminSet-NER, the first NER dataset consisting exclusively of administrative texts in French."
}
Downloads last month
129
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train taln-ls2n/AdminBERT-4GB