JuriBERT: A Masked-Language Model Adaptation for French Legal Text
Introduction
JuriBERT is a set of BERT models (tiny, mini, small and base) pre-trained from scratch on French legal-domain specific corpora. JuriBERT models are pretrained on 6.3GB of legal french raw text from two different sources: the first dataset is crawled from Légifrance and the other one consists of anonymized court’s decisions and the pleadings from the Court of Cassation (mémoires ampliatifs). The latter contains more than 100k long documents from different court cases.
It is now on Hugging Face in four different versions with varying number of parameters.
JuriBERT Pre-trained models
Model | #params | Architecture |
---|---|---|
dascim/juribert-tiny |
6M | Tiny (L=2, H=128, A=2) |
dascim/juribert-mini |
15M | Mini (L=4, H=256, A=4) |
dascim/juribert-small |
42M | Small (L=6, H=512, A=8) |
dascim/juribert-base |
110M | Base (L=12, H=768, A=12) |
JuriBERT Usage
Load JuriBERT and its sub-word tokenizer :
from transformers import AutoModel, AutoTokenizer
# You can replace "juribert-base" with any other model from the table, e.g. "dascim/juribert-small".
tokenizer = AutoTokenizer.from_pretrained("dascim/juribert-base")
juribert = AutoModel.from_pretrained("dascim/juribert-base")
juribert.eval() # disable dropout (or leave in train mode to finetune)
Filling masks using pipeline
from transformers import pipeline
juribert_fill_mask = pipeline("fill-mask", model="dascim/juribert-base", tokenizer="dascim/juribert-base")
results = juribert_fill_mask("la chambre <mask> est une chambre de la cour de cassation.")
# results
# [{'score': 0.3455437421798706, 'token': 579, 'token_str': ' civile', 'sequence': 'la chambre civile est une chambre de la cour de cassation.'},
# {'score': 0.13046401739120483, 'token': 397, 'token_str': ' qui', 'sequence': 'la chambre qui est une chambre de la cour de cassation.'},
# {'score': 0.12387491017580032, 'token': 1060, 'token_str': ' sociale', 'sequence': 'la chambre sociale est une chambre de la cour de cassation.'},
# {'score': 0.05491165071725845, 'token': 266, 'token_str': ' c', 'sequence': 'la chambre c est une chambre de la cour de cassation.'},
# {'score': 0.04244831204414368, 'token': 2421, 'token_str': ' commerciale', 'sequence': 'la chambre commerciale est une chambre de la cour de cassation.'}]
Extract contextual embedding features from JuriBERT output
encoded_sentence = tokenizer.encode("Les articles 21 et 22 de la présente annexe sont applicables au titre V de la loi du 1er juin 1924 mettant en vigueur la législation civile française dans les départements du Bas-Rhin, du Haut-Rhin et de la Moselle, et relatif à l'exécution forcée sur les immeubles, à la procédure en matière de purge des hypothèques et à la procédure d'ordre.", return_tensors='pt')
embeddings = juribert(encoded_sentence).last_hidden_state
print(embeddings)
# tensor([[[-0.5490, -1.4505, -0.6244, ..., -0.9739, 0.4767, -0.0655],
# [ 0.6415, -1.4368, 0.8708, ..., -0.4093, 0.6691, 0.7238],
# [-0.2195, -0.1235, 0.2674, ..., 0.5372, -0.4903, 0.5960],
# ...,
# [-1.4168, -1.3238, 1.1748, ..., 0.7590, 1.0338, -0.4865],
# [-0.5240, -0.7168, 0.8667, ..., -0.5848, 1.0086, -1.3153],
# [ 0.2743, -0.3438, 1.1101, ..., -0.5587, 0.0830, -0.3144]]],
# grad_fn=<NativeLayerNormBackward0>)
Authors
JuriBERT was trained and evaluated at École Polytechnique in collaboration with HEC Paris by Stella Douka, Hadi Abdine, Mihcalis Vazirgiannis, Rajaa El Hamdani and David Restrepo Amariles.
Citation
If you use our work, please cite:
@inproceedings{douka-etal-2021-juribert,
title = "{J}uri{BERT}: A Masked-Language Model Adaptation for {F}rench Legal Text",
author="Douka, Stella and Abdine, Hadi and Vazirgiannis, Michalis and El Hamdani, Rajaa and Restrepo Amariles, David",
booktitle="Proceedings of the Natural Legal Language Processing Workshop 2021",
month=nov,
year="2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.nllp-1.9",
pages = "95--101",
}
- Downloads last month
- 2