File size: 9,831 Bytes
597f536 ecd611d 597f536 a9838f1 f53abb6 597f536 a9838f1 597f536 a9838f1 597f536 a9838f1 597f536 a9838f1 597f536 a9838f1 597f536 7f33d97 c14b676 9f30b29 c14b676 a9838f1 c14b676 d6a4c59 9f30b29 a197b92 19c4080 a197b92 d789b2b e9d7237 19c4080 e9d7237 19c4080 a197b92 a9838f1 c14b676 338b0a8 a9838f1 d789b2b c14b676 28ba6d5 c14b676 d789b2b 28ba6d5 c14b676 d789b2b c14b676 28ba6d5 c14b676 d789b2b 28ba6d5 c14b676 338b0a8 28ba6d5 d789b2b 28ba6d5 9e761cf c14b676 9f30b29 c14b676 a197b92 968f303 fd8f5e0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
---
datasets:
- eduagarcia/LegalPT_dedup
- eduagarcia/CrawlPT_dedup
language:
- pt
pipeline_tag: fill-mask
tags:
- legal
model-index:
- name: RoBERTaLexPT-base
results:
- task:
type: token-classification
dataset:
type: lener_br
name: lener_br
split: test
metrics:
- type: seqeval
value: 0.9073
name: F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/PortuLex_benchmark
name: UlyNER-PL Coarse
config: UlyssesNER-Br-PL-coarse
split: test
metrics:
- type: seqeval
value: 0.8856
name: F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/PortuLex_benchmark
name: UlyNER-PL Fine
config: UlyssesNER-Br-PL-fine
split: test
metrics:
- type: seqeval
value: 0.8603
name: F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/PortuLex_benchmark
name: FGV-STF
config: fgv-coarse
split: test
metrics:
- type: seqeval
value: 0.8040
name: F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/PortuLex_benchmark
name: RRIP
config: rrip
split: test
metrics:
- type: seqeval
value: 0.8322
name: F1
args:
scheme: IOB2
- task:
type: token-classification
dataset:
type: eduagarcia/PortuLex_benchmark
name: PortuLex
split: test
metrics:
- type: seqeval
value: 0.8541
name: Average F1
args:
scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaLexPT-base
RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** [Coming soon]
## Evaluation
The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
| | | Coarse/Fine | Coarse | | |
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
| **Ours** | | | | | |
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
| [RoBERTaLegalPT-base](https://huggingface.co/eduagarcia/RoBERTaCrawlPT-base) (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
| RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
| **RoBERTaLexPT-base (this)** (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.
## Training Details
RoBERTaLexPT-base is pretrained on:
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
### Training Procedure
Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.
This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
#### Preprocessing
We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
#### Training Hyperparameters
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
The weight initialization is random.
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
| **Hyperparameter** | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers | 12 |
| Hidden size | 768 |
| FFN inner hidden size | 3072 |
| Attention heads | 12 |
| Attention head size | 64 |
| Dropout | 0.1 |
| Attention dropout | 0.1 |
| Warmup steps | 6k |
| Peak learning rate | 4e-4 |
| Batch size | 2048 |
| Weight decay | 0.01 |
| Maximum training steps | 62.5k |
| Learning rate decay | Linear |
| AdamW $$\epsilon$$ | 1e-6 |
| AdamW $$\beta_1$$ | 0.9 |
| AdamW $$\beta_2$$ | 0.98 |
| Gradient clipping | 0.0 |
## Citation
```
@InProceedings{garcia2024_roberlexpt,
author="Garcia, Eduardo A. S.
and Silva, N{\'a}dia F. F.
and Siqueira, Felipe
and Gomes, Juliana R. S.
and Albuqueruqe, Hidelberg O.
and Souza, Ellen
and Lima, Eliomar
and De Carvalho, André",
title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
booktitle="Computational Processing of the Portuguese Language",
year="2024",
publisher="Association for Computational Linguistics"
}
```
## Acknowledgment
This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG). |