File size: 9,574 Bytes
597f536
 
ecd611d
 
597f536
 
 
 
 
 
 
 
 
 
 
a9838f1
f53abb6
597f536
 
 
a9838f1
 
597f536
 
 
 
 
a9838f1
597f536
 
 
 
 
a9838f1
 
597f536
 
 
 
 
a9838f1
597f536
 
 
 
 
a9838f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
597f536
 
 
7f33d97
 
c14b676
9f30b29
c14b676
a9838f1
c14b676
d6a4c59
9f30b29
 
a197b92
 
 
 
a9838f1
a197b92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9d7237
 
 
 
a9838f1
a197b92
 
a9838f1
c14b676
 
 
28ba6d5
a9838f1
 
c14b676
28ba6d5
c14b676
28ba6d5
 
c14b676
 
9e761cf
c14b676
28ba6d5
c14b676
28ba6d5
 
 
c14b676
 
 
28ba6d5
 
 
 
a197b92
28ba6d5
 
9e761cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c14b676
9f30b29
c14b676
a197b92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
968f303
 
fd8f5e0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
datasets:
- eduagarcia/LegalPT_dedup
- eduagarcia/CrawlPT_dedup
language:
- pt
pipeline_tag: fill-mask
tags:
- legal
model-index:
- name: RoBERTaLexPT-base
  results:
  - task:
      type: token-classification
    dataset:
      type: lener_br
      name: lener_br
      split: test
    metrics:
    - type: seqeval
      value: 0.9073
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Coarse
      config: UlyssesNER-Br-PL-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8856
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Fine
      config: UlyssesNER-Br-PL-fine
      split: test
    metrics:
    - type: seqeval
      value: 0.8603
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: FGV-STF
      config: fgv-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8040
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: RRIP
      config: rrip
      split: test
    metrics:
    - type: seqeval
      value: 0.8322
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: PortuLex
      split: test
    metrics:
    - type: seqeval
      value: 0.8541
      name: Average F1
      args:
        scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaLexPT-base

RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).

- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** [Coming soon]

## Evaluation

The model was evaluated on ["PortuLex" benchmark](eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:

| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
| [BERTikal-base](https://arxiv.org/abs/2110.15709)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
| **Ours**                                                                   |           |                 |             |           |                 |
| RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
| RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
| RoBERTaCrawlPT-base  (Trained on CrawlPT)                                  | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
| **RoBERTaLexPT-base** (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |

In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size. 
With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.

## Training Details

RoBERTaLexPT-base is pretrained from both data:
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).

### Training Procedure

Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.


This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.

#### Preprocessing

Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.

#### Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.

For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):


| **Hyperparameter**     | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers       |               12 |
| Hidden size            |              768 |
| FFN inner hidden size  |             3072 |
| Attention heads        |               12 |
| Attention head size    |               64 |
| Dropout                |              0.1 |
| Attention dropout      |              0.1 |
| Warmup steps           |               6k |
| Peak learning rate     |             4e-4 |
| Batch size             |             2048 |
| Weight decay           |             0.01 |
| Maximum training steps |            62.5k |
| Learning rate decay    |           Linear |
| AdamW $$\epsilon$$     |             1e-6 |
| AdamW $$\beta_1$$      |              0.9 |
| AdamW $$\beta_2$$      |             0.98 |
| Gradient clipping      |              0.0 |

## Citation

```
@InProceedings{garcia2024_roberlexpt,
    author="Garcia, Eduardo A. S.
    and Silva, N{\'a}dia F. F.
    and Siqueira, Felipe
    and Gomes, Juliana R. S.
    and Albuqueruqe, Hidelberg O.
    and Souza, Ellen
    and Lima, Eliomar
    and De Carvalho, André",
    title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
    booktitle="Computational Processing of the Portuguese Language",
    year="2024",
    publisher="Association for Computational Linguistics"
}
```

## Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).