RoBERTaLexPT-base / README.md
eduagarcia's picture
Update README.md
e9d7237 verified
|
raw
history blame
8.8 kB
---
datasets:
- eduagarcia/LegalPT
- eduagarcia/cc100-pt
- eduagarcia/OSCAR-2301-pt_dedup
- eduagarcia/brwac_dedup
language:
- pt
pipeline_tag: fill-mask
tags:
- legal
model-index:
- name: RoBERTaLexPT-base
results:
- task:
type: token-classification
dataset:
type: eduagarcia/portuguese_benchmark
name: LeNER
config: LeNER-Br
split: test
metrics:
- type: seqeval
value: 90.73
name: Mean F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/portuguese_benchmark
name: UlyNER-PL Coarse
config: UlyssesNER-Br-PL-coarse
split: test
metrics:
- type: seqeval
value: 88.56
name: Mean F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/portuguese_benchmark
name: UlyNER-PL Fine
config: UlyssesNER-Br-PL-fine
split: test
metrics:
- type: seqeval
value: 86.03
name: Mean F1
args:
scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaLexPT-base
RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) and CrawlPT corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** [Coming soon]
## Evaluation
The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
| | | Coarse/Fine | Coarse | | |
| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
| [BERTikal-base](https://arxiv.org/abs/2110.15709) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
| **Ours** | | | | | |
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
| RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
| RoBERTaLexPT-base (this) (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
## Training Details
RoBERTaLexPT-base is pretrained from both data:
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
- CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
### Training Procedure
Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
#### Preprocessing
Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
#### Training Hyperparameters
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
| **Hyperparameter** | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers | 12 |
| Hidden size | 768 |
| FFN inner hidden size | 3072 |
| Attention heads | 12 |
| Attention head size | 64 |
| Dropout | 0.1 |
| Attention dropout | 0.1 |
| Warmup steps | 6k |
| Peak learning rate | 4e-4 |
| Batch size | 2048 |
| Weight decay | 0.01 |
| Maximum training steps | 62.5k |
| Learning rate decay | Linear |
| AdamW $$\epsilon$$ | 1e-6 |
| AdamW $$\beta_1$$ | 0.9 |
| AdamW $$\beta_2$$ | 0.98 |
| Gradient clipping | 0.0 |
## Citation
```
@InProceedings{garcia2024_roberlexpt,
author="Garcia, Eduardo A. S.
and Silva, N{\'a}dia F. F.
and Siqueira, Felipe
and Gomes, Juliana R. S.
and Albuqueruqe, Hidelberg O.
and Souza, Ellen
and Lima, Eliomar
and De Carvalho, André",
title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
booktitle="Computational Processing of the Portuguese Language",
year="2024",
publisher="Association for Computational Linguistics"
}
```
## Acknowledgment
This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).