datasets:
- eduagarcia/LegalPT
- eduagarcia/cc100-pt
- eduagarcia/OSCAR-2301-pt_dedup
- eduagarcia/brwac_dedup
language:
- pt
pipeline_tag: fill-mask
tags:
- legal
model-index:
- name: RoBERTaLexPT-base
results:
- task:
type: token-classification
dataset:
type: eduagarcia/portuguese_benchmark
name: LeNER
config: LeNER-Br
split: test
metrics:
- type: seqeval
value: 90.73
name: Mean F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/portuguese_benchmark
name: UlyNER-PL Coarse
config: UlyssesNER-Br-PL-coarse
split: test
metrics:
- type: seqeval
value: 88.56
name: Mean F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/portuguese_benchmark
name: UlyNER-PL Fine
config: UlyssesNER-Br-PL-fine
split: test
metrics:
- type: seqeval
value: 86.03
name: Mean F1
args:
scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
RoBERTaLexPT-base
RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora, using the same architecture as RoBERTa-base, introduced by Liu et al. (2019).
- Language(s) (NLP): Brazilian Portuguese (pt-BR)
- License: Creative Commons Attribution 4.0 International Public License
- Repository: https://github.com/eduagarcia/roberta-legal-portuguese
- Paper: [Coming soon]
Evaluation
The model was evaluated on "PortuLex" benchmark, a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
Macro F1-Score (%) for multiple models evaluated on PortuLex benchmark test splits:
Model | LeNER | UlyNER-PL | FGV-STF | RRIP | Average (%) |
---|---|---|---|---|---|
Coarse/Fine | Coarse | ||||
BERTimbau-base | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
BERTimbau-large | 88.64 | 87.77/84.74 | 79.71 | 83.79 | 84.60 |
Albertina-PT-BR-base | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
Albertina-PT-BR-xlarge | 90.09 | 88.36/86.62 | 79.94 | 82.79 | 85.08 |
BERTikal-base | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
JurisBERT-base | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
BERTimbauLAW-base | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
Legal-XLM-R-base | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
Legal-XLM-R-large | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
Legal-RoBERTa-PT-large | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
Ours | |||||
RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
RoBERTaLexPT-base (this) (Trained on CrawlPT + LegalPT) | 90.73 | 88.56/86.03 | 80.40 | 83.22 | 85.41 |
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size. With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
Training Details
RoBERTaLexPT-base is pretrained from both data:
- LegalPT is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
- CrawlPT is a duplication of three Portuguese general corpora: brWaC, CC100-PT, OSCAR-2301.
Training Procedure
Our pretraining process was executed using the Fairseq library on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. The complete training of a single configuration takes approximately three days.
This computational setup is similar to the work of BERTimbau, exposing the model to approximately 65 billion tokens during training.
Preprocessing
Following the approach of Lee et al. (2022), we deduplicated all subsets of the LegalPT Corpus using the MinHash algorithm and Locality Sensitive Hashing to find clusters of duplicate documents.
To ensure that domain models are not constrained by a generic vocabulary, we utilized the HuggingFace Tokenizers -- BPE algorithm to train a vocabulary for each pre-training corpus used.
Training Hyperparameters
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens. We employed the masked language modeling objective, where 15% of the input tokens were randomly masked. The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
For other hyperparameters we adopted the standard RoBERTa hyperparameters:
Hyperparameter | RoBERTa-base |
---|---|
Number of layers | 12 |
Hidden size | 768 |
FFN inner hidden size | 3072 |
Attention heads | 12 |
Attention head size | 64 |
Dropout | 0.1 |
Attention dropout | 0.1 |
Warmup steps | 6k |
Peak learning rate | 4e-4 |
Batch size | 2048 |
Weight decay | 0.01 |
Maximum training steps | 62.5k |
Learning rate decay | Linear |
AdamW $$\epsilon$$ | 1e-6 |
AdamW $$\beta_1$$ | 0.9 |
AdamW $$\beta_2$$ | 0.98 |
Gradient clipping | 0.0 |
Citation
@InProceedings{garcia2024_roberlexpt,
author="Garcia, Eduardo A. S.
and Silva, N{\'a}dia F. F.
and Siqueira, Felipe
and Gomes, Juliana R. S.
and Albuqueruqe, Hidelberg O.
and Souza, Ellen
and Lima, Eliomar
and De Carvalho, André",
title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
booktitle="Computational Processing of the Portuguese Language",
year="2024",
publisher="Association for Computational Linguistics"
}
Acknowledgment
This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).