File size: 9,574 Bytes

---
datasets:
- eduagarcia/LegalPT_dedup
- eduagarcia/CrawlPT_dedup
language:
- pt
pipeline_tag: fill-mask
tags:
- legal
model-index:
- name: RoBERTaLexPT-base
  results:
  - task:
      type: token-classification
    dataset:
      type: lener_br
      name: lener_br
      split: test
    metrics:
    - type: seqeval
      value: 0.9073
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Coarse
      config: UlyssesNER-Br-PL-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8856
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Fine
      config: UlyssesNER-Br-PL-fine
      split: test
    metrics:
    - type: seqeval
      value: 0.8603
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: FGV-STF
      config: fgv-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8040
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: RRIP
      config: rrip
      split: test
    metrics:
    - type: seqeval
      value: 0.8322
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: PortuLex
      split: test
    metrics:
    - type: seqeval
      value: 0.8541
      name: Average F1
      args:
        scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaLexPT-base

RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).

- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** [Coming soon]

## Evaluation

The model was evaluated on ["PortuLex" benchmark](eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:

| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
| [BERTikal-base](https://arxiv.org/abs/2110.15709)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
| **Ours**                                                                   |           |                 |             |           |                 |
| RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
| RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
| RoBERTaCrawlPT-base  (Trained on CrawlPT)                                  | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
| **RoBERTaLexPT-base** (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |

In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size. 
With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.

## Training Details

RoBERTaLexPT-base is pretrained from both data:
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).

### Training Procedure

Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.


This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.

#### Preprocessing

Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.

#### Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.

For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):


| **Hyperparameter**     | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers       |               12 |
| Hidden size            |              768 |
| FFN inner hidden size  |             3072 |
| Attention heads        |               12 |
| Attention head size    |               64 |
| Dropout                |              0.1 |
| Attention dropout      |              0.1 |
| Warmup steps           |               6k |
| Peak learning rate     |             4e-4 |
| Batch size             |             2048 |
| Weight decay           |             0.01 |
| Maximum training steps |            62.5k |
| Learning rate decay    |           Linear |
| AdamW $$\epsilon$$     |             1e-6 |
| AdamW $$\beta_1$$      |              0.9 |
| AdamW $$\beta_2$$      |             0.98 |
| Gradient clipping      |              0.0 |

## Citation

```
@InProceedings{garcia2024_roberlexpt,
    author="Garcia, Eduardo A. S.
    and Silva, N{\'a}dia F. F.
    and Siqueira, Felipe
    and Gomes, Juliana R. S.
    and Albuqueruqe, Hidelberg O.
    and Souza, Ellen
    and Lima, Eliomar
    and De Carvalho, André",
    title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
    booktitle="Computational Processing of the Portuguese Language",
    year="2024",
    publisher="Association for Computational Linguistics"
}
```

## Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).