File size: 9,802 Bytes
597f536
 
ecd611d
 
597f536
 
 
 
 
 
 
 
 
 
 
a9838f1
f53abb6
597f536
 
 
a9838f1
 
597f536
 
 
 
 
a9838f1
597f536
 
 
 
 
a9838f1
 
597f536
 
 
 
 
a9838f1
597f536
 
 
 
 
a9838f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
597f536
 
 
7f33d97
 
c14b676
9f30b29
c14b676
c4dd803
c14b676
d6a4c59
9f30b29
 
c4dd803
a197b92
 
 
03f6fb0
a197b92
 
 
 
 
 
d789b2b
 
 
 
 
 
 
 
 
 
e9d7237
 
92932c3
 
19c4080
a197b92
 
a9838f1
c14b676
 
 
338b0a8
a9838f1
d789b2b
c14b676
28ba6d5
c14b676
d789b2b
28ba6d5
c14b676
 
d789b2b
c14b676
28ba6d5
c14b676
d789b2b
28ba6d5
 
c14b676
 
 
338b0a8
 
 
 
28ba6d5
d789b2b
28ba6d5
 
9e761cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c14b676
9f30b29
c14b676
a197b92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
968f303
 
fd8f5e0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
datasets:
- eduagarcia/LegalPT_dedup
- eduagarcia/CrawlPT_dedup
language:
- pt
pipeline_tag: fill-mask
tags:
- legal
model-index:
- name: RoBERTaLexPT-base
  results:
  - task:
      type: token-classification
    dataset:
      type: lener_br
      name: lener_br
      split: test
    metrics:
    - type: seqeval
      value: 0.9073
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Coarse
      config: UlyssesNER-Br-PL-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8856
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Fine
      config: UlyssesNER-Br-PL-fine
      split: test
    metrics:
    - type: seqeval
      value: 0.8603
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: FGV-STF
      config: fgv-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8040
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: RRIP
      config: rrip
      split: test
    metrics:
    - type: seqeval
      value: 0.8322
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: PortuLex
      split: test
    metrics:
    - type: seqeval
      value: 0.8541
      name: Average F1
      args:
        scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaLexPT-base

RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by Liu et al. (2019).

- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** https://aclanthology.org/2024.propor-1.38/

## Evaluation

The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:

| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
| **Ours**                                                                   |           |                 |             |           |                 |
| RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
| RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
| [RoBERTaCrawlPT-base](https://huggingface.co/eduagarcia/RoBERTaCrawlPT-base)  (Trained on CrawlPT)   | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
| **RoBERTaLexPT-base (this)** (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |

In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size. 
With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.

## Training Details

RoBERTaLexPT-base is pretrained on:
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).

### Training Procedure

Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.


This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.

#### Preprocessing

We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.

#### Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.  
The weight initialization is random.  
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.  
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.  

For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):


| **Hyperparameter**     | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers       |               12 |
| Hidden size            |              768 |
| FFN inner hidden size  |             3072 |
| Attention heads        |               12 |
| Attention head size    |               64 |
| Dropout                |              0.1 |
| Attention dropout      |              0.1 |
| Warmup steps           |               6k |
| Peak learning rate     |             4e-4 |
| Batch size             |             2048 |
| Weight decay           |             0.01 |
| Maximum training steps |            62.5k |
| Learning rate decay    |           Linear |
| AdamW $$\epsilon$$     |             1e-6 |
| AdamW $$\beta_1$$      |              0.9 |
| AdamW $$\beta_2$$      |             0.98 |
| Gradient clipping      |              0.0 |

## Citation

```
@InProceedings{garcia2024_roberlexpt,
    author="Garcia, Eduardo A. S.
    and Silva, N{\'a}dia F. F.
    and Siqueira, Felipe
    and Gomes, Juliana R. S.
    and Albuqueruqe, Hidelberg O.
    and Souza, Ellen
    and Lima, Eliomar
    and De Carvalho, André",
    title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
    booktitle="Computational Processing of the Portuguese Language",
    year="2024",
    publisher="Association for Computational Linguistics"
}
```

## Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).