metadata

datasets:
  - eduagarcia/LegalPT
  - eduagarcia/cc100-pt
  - eduagarcia/OSCAR-2301-pt_dedup
  - eduagarcia/brwac_dedup
language:
  - pt
pipeline_tag: fill-mask
tags:
  - legal
model-index:
  - name: RoBERTaLexPT-base
    results:
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/portuguese_benchmark
          name: LeNER
          config: LeNER-Br
          split: test
        metrics:
          - type: seqeval
            value: 90.73
            name: Mean F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/portuguese_benchmark
          name: UlyNER-PL Coarse
          config: UlyssesNER-Br-PL-coarse
          split: test
        metrics:
          - type: seqeval
            value: 88.56
            name: Mean F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/portuguese_benchmark
          name: UlyNER-PL Fine
          config: UlyssesNER-Br-PL-fine
          split: test
        metrics:
          - type: seqeval
            value: 86.03
            name: Mean F1
            args:
              scheme: IOB2
license: cc-by-4.0
metrics:
  - seqeval

RoBERTaLexPT-base

RoBERTaLexPT-base is pretrained from LegalPT and CrawlPT corpora, using RoBERTa-base, introduced by Liu et al. (2019).

Language(s) (NLP): Brazilian Portuguese (pt-BR)
License: Creative Commons Attribution 4.0 International Public License
Repository: https://github.com/eduagarcia/roberta-legal-portuguese
Paper: [More Information Needed]

Training Details

RoBERTaLexPT-base is pretrained from both data:

LegalPT is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
CrawlPT is a duplication of three Portuguese general corpora: brWaC, CC100-PT, OSCAR-2301.

Training Procedure

Our pretraining process was executed using the Fairseq library on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. The complete training of a single configuration takes approximately three days.

This computational setup is similar to the work of BERTimbau, exposing the model to approximately 65 billion tokens during training.

Preprocessing

Following the approach of Lee et al. (2022), we deduplicated all subsets of the LegalPT Corpus using the MinHash algorithm and Locality Sensitive Hashing to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the HuggingFace Tokenizers -- BPE algorithm to train a vocabulary for each pre-training corpus used.

Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens. We employed the masked language modeling objective, where 15% of the input tokens were randomly masked. The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.

We adopted the standard RoBERTa hyperparameters:

Hyperparameter	RoBERTa-base
Number of layers	12
Hidden size	768
FFN inner hidden size	3072
Attention heads	12
Attention head size	64
Dropout	0.1
Attention dropout	0.1
Warmup steps	6k
Peak learning rate	4e-4
Batch size	2048
Weight decay	0.01
Maximum training steps	62.5k
Learning rate decay	Linear
AdamW $$\epsilon$$	1e-6
AdamW $$\beta_1$$	0.9
AdamW $$\beta_2$$	0.98
Gradient clipping	0.0

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on "PortuLex" benchmark, a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

eduagarcia
/

RoBERTaLexPT-base

RoBERTaLexPT-base

Training Details

Training Procedure

Preprocessing

Training Hyperparameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Summary

Citation

Acknowledgment