RoBERTaCrawlPT-base / README.md
eduagarcia's picture
Update README.md
2d39042 verified
|
raw
history blame
9.34 kB
metadata
datasets:
  - eduagarcia/CrawlPT_dedup
language:
  - pt
pipeline_tag: fill-mask
model-index:
  - name: RoBERTaCrawlPT-base
    results:
      - task:
          type: token-classification
        dataset:
          type: lener_br
          name: lener_br
          split: test
        metrics:
          - type: seqeval
            value: 0.8924
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: UlyNER-PL Coarse
          config: UlyssesNER-Br-PL-coarse
          split: test
        metrics:
          - type: seqeval
            value: 0.8822
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: UlyNER-PL Fine
          config: UlyssesNER-Br-PL-fine
          split: test
        metrics:
          - type: seqeval
            value: 0.8658
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: FGV-STF
          config: fgv-coarse
          split: test
        metrics:
          - type: seqeval
            value: 0.7988
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: RRIP
          config: rrip
          split: test
        metrics:
          - type: seqeval
            value: 0.828
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: PortuLex
          split: test
        metrics:
          - type: seqeval
            value: 0.8483
            name: Average F1
            args:
              scheme: IOB2
license: cc-by-4.0
metrics:
  - seqeval

RoBERTaCrawlPT-base

RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora, using the same architecture as RoBERTa-base. This model is part of the RoBERTaLexPT work: [Coming soon]

Generic Evaluation

TO-DO...

Legal Evaluation

The model was evaluated on "PortuLex" benchmark, a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (%) for multiple models evaluated on PortuLex benchmark test splits:

Model LeNER UlyNER-PL FGV-STF RRIP Average (%)
Coarse/Fine Coarse
BERTimbau-base 88.34 86.39/83.83 79.34 82.34 83.78
BERTimbau-large 88.64 87.77/84.74 79.71 83.79 84.60
Albertina-PT-BR-base 89.26 86.35/84.63 79.30 81.16 83.80
Albertina-PT-BR-xlarge 90.09 88.36/86.62 79.94 82.79 85.08
BERTikal-base 83.68 79.21/75.70 77.73 81.11 79.99
JurisBERT-base 81.74 81.67/77.97 76.04 80.85 79.61
BERTimbauLAW-base 84.90 87.11/84.42 79.78 82.35 83.20
Legal-XLM-R-base 87.48 83.49/83.16 79.79 82.35 83.24
Legal-XLM-R-large 88.39 84.65/84.55 79.36 81.66 83.50
Legal-RoBERTa-PT-large 87.96 88.32/84.83 79.57 81.98 84.02
Ours
RoBERTaTimbau-base (Reproduction of BERTimbau) 89.68 87.53/85.74 78.82 82.03 84.29
RoBERTaLegalPT-base (Trained on LegalPT) 90.59 85.45/84.40 79.92 82.84 84.57
RoBERTaCrawlPT-base (this) (Trained on CrawlPT) 89.24 88.22/86.58 79.88 82.80 84.83
RoBERTaLexPT-base (Trained on CrawlPT + LegalPT) 90.73 88.56/86.03 80.40 83.22 85.41

Training Details

RoBERTaCrawlPT is pretrained on:

Training Procedure

Our pretraining process was executed using the Fairseq library v0.10.2 on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. The complete training of a single configuration takes approximately three days.

This computational cost is similar to the work of BERTimbau-base, exposing the model to approximately 65 billion tokens during training.

Preprocessing

We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary text-dedup to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the HuggingFace Tokenizers -- BPE algorithm to train a vocabulary for each pre-training corpus used.

Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
The weight initialization is random.
We employed the masked language modeling objective, where 15% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.

For other parameters we adopted the standard RoBERTa-base hyperparameters:

Hyperparameter RoBERTa-base
Number of layers 12
Hidden size 768
FFN inner hidden size 3072
Attention heads 12
Attention head size 64
Dropout 0.1
Attention dropout 0.1
Warmup steps 6k
Peak learning rate 4e-4
Batch size 2048
Weight decay 0.01
Maximum training steps 62.5k
Learning rate decay Linear
AdamW $$\epsilon$$ 1e-6
AdamW $$\beta_1$$ 0.9
AdamW $$\beta_2$$ 0.98
Gradient clipping 0.0

Citation

@InProceedings{garcia2024_roberlexpt,
    author="Garcia, Eduardo A. S.
    and Silva, N{\'a}dia F. F.
    and Siqueira, Felipe
    and Gomes, Juliana R. S.
    and Albuqueruqe, Hidelberg O.
    and Souza, Ellen
    and Lima, Eliomar
    and De Carvalho, André",
    title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
    booktitle="Computational Processing of the Portuguese Language",
    year="2024",
    publisher="Association for Computational Linguistics"
}

Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).