eduagarcia
commited on
Commit
•
3171fcb
1
Parent(s):
3875afe
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- eduagarcia/CrawlPT_dedup
|
4 |
+
language:
|
5 |
+
- pt
|
6 |
+
pipeline_tag: fill-mask
|
7 |
+
model-index:
|
8 |
+
- name: RoBERTaCrawlPT-base
|
9 |
+
results:
|
10 |
+
- task:
|
11 |
+
type: token-classification
|
12 |
+
dataset:
|
13 |
+
type: lener_br
|
14 |
+
name: lener_br
|
15 |
+
split: test
|
16 |
+
metrics:
|
17 |
+
- type: seqeval
|
18 |
+
value: 0.8924
|
19 |
+
name: F1
|
20 |
+
args:
|
21 |
+
scheme: IOB2
|
22 |
+
- task:
|
23 |
+
type: token-classification
|
24 |
+
dataset:
|
25 |
+
type: eduagarcia/PortuLex_benchmark
|
26 |
+
name: UlyNER-PL Coarse
|
27 |
+
config: UlyssesNER-Br-PL-coarse
|
28 |
+
split: test
|
29 |
+
metrics:
|
30 |
+
- type: seqeval
|
31 |
+
value: 0.8822
|
32 |
+
name: F1
|
33 |
+
args:
|
34 |
+
scheme: IOB2
|
35 |
+
- task:
|
36 |
+
type: token-classification
|
37 |
+
dataset:
|
38 |
+
type: eduagarcia/PortuLex_benchmark
|
39 |
+
name: UlyNER-PL Fine
|
40 |
+
config: UlyssesNER-Br-PL-fine
|
41 |
+
split: test
|
42 |
+
metrics:
|
43 |
+
- type: seqeval
|
44 |
+
value: 0.8658
|
45 |
+
name: F1
|
46 |
+
args:
|
47 |
+
scheme: IOB2
|
48 |
+
- task:
|
49 |
+
type: token-classification
|
50 |
+
dataset:
|
51 |
+
type: eduagarcia/PortuLex_benchmark
|
52 |
+
name: FGV-STF
|
53 |
+
config: fgv-coarse
|
54 |
+
split: test
|
55 |
+
metrics:
|
56 |
+
- type: seqeval
|
57 |
+
value: 0.7988
|
58 |
+
name: F1
|
59 |
+
args:
|
60 |
+
scheme: IOB2
|
61 |
+
- task:
|
62 |
+
type: token-classification
|
63 |
+
dataset:
|
64 |
+
type: eduagarcia/PortuLex_benchmark
|
65 |
+
name: RRIP
|
66 |
+
config: rrip
|
67 |
+
split: test
|
68 |
+
metrics:
|
69 |
+
- type: seqeval
|
70 |
+
value: 0.8280
|
71 |
+
name: F1
|
72 |
+
args:
|
73 |
+
scheme: IOB2
|
74 |
+
- task:
|
75 |
+
type: token-classification
|
76 |
+
dataset:
|
77 |
+
type: eduagarcia/PortuLex_benchmark
|
78 |
+
name: PortuLex
|
79 |
+
split: test
|
80 |
+
metrics:
|
81 |
+
- type: seqeval
|
82 |
+
value: 0.8483
|
83 |
+
name: Average F1
|
84 |
+
args:
|
85 |
+
scheme: IOB2
|
86 |
+
license: cc-by-4.0
|
87 |
+
metrics:
|
88 |
+
- seqeval
|
89 |
+
---
|
90 |
+
# RoBERTaCrawlPT-base
|
91 |
+
|
92 |
+
RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base).
|
93 |
+
This model is part of the [RoBERTaLexPT](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) work: [Coming soon]
|
94 |
+
|
95 |
+
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
|
96 |
+
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
|
97 |
+
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
|
98 |
+
- **Paper:** [Coming soon]
|
99 |
+
|
100 |
+
## Generic Evaluation
|
101 |
+
|
102 |
+
TO-DO...
|
103 |
+
|
104 |
+
## Legal Evaluation
|
105 |
+
|
106 |
+
The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
|
107 |
+
|
108 |
+
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
|
109 |
+
|
110 |
+
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|
111 |
+
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|
112 |
+
| | | Coarse/Fine | Coarse | | |
|
113 |
+
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
|
114 |
+
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
|
115 |
+
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
|
116 |
+
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
|
117 |
+
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
|
118 |
+
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
|
119 |
+
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
|
120 |
+
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
|
121 |
+
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
|
122 |
+
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
|
123 |
+
| **Ours** | | | | | |
|
124 |
+
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
125 |
+
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
126 |
+
| **RoBERTaCrawlPT-base (this)** (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
|
127 |
+
| [RoBERTaLegalPT-base](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
|
128 |
+
|
129 |
+
|
130 |
+
## Training Details
|
131 |
+
|
132 |
+
RoBERTaCrawlPT is pretrained on:
|
133 |
+
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
|
134 |
+
|
135 |
+
### Training Procedure
|
136 |
+
|
137 |
+
Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
|
138 |
+
The complete training of a single configuration takes approximately three days.
|
139 |
+
|
140 |
+
|
141 |
+
This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
|
142 |
+
|
143 |
+
#### Preprocessing
|
144 |
+
|
145 |
+
We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
|
146 |
+
|
147 |
+
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
148 |
+
|
149 |
+
#### Training Hyperparameters
|
150 |
+
|
151 |
+
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
|
152 |
+
The weight initialization is random.
|
153 |
+
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
154 |
+
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
155 |
+
|
156 |
+
For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
|
157 |
+
|
158 |
+
|
159 |
+
| **Hyperparameter** | **RoBERTa-base** |
|
160 |
+
|------------------------|-----------------:|
|
161 |
+
| Number of layers | 12 |
|
162 |
+
| Hidden size | 768 |
|
163 |
+
| FFN inner hidden size | 3072 |
|
164 |
+
| Attention heads | 12 |
|
165 |
+
| Attention head size | 64 |
|
166 |
+
| Dropout | 0.1 |
|
167 |
+
| Attention dropout | 0.1 |
|
168 |
+
| Warmup steps | 6k |
|
169 |
+
| Peak learning rate | 4e-4 |
|
170 |
+
| Batch size | 2048 |
|
171 |
+
| Weight decay | 0.01 |
|
172 |
+
| Maximum training steps | 62.5k |
|
173 |
+
| Learning rate decay | Linear |
|
174 |
+
| AdamW $$\epsilon$$ | 1e-6 |
|
175 |
+
| AdamW $$\beta_1$$ | 0.9 |
|
176 |
+
| AdamW $$\beta_2$$ | 0.98 |
|
177 |
+
| Gradient clipping | 0.0 |
|
178 |
+
|
179 |
+
## Citation
|
180 |
+
|
181 |
+
```
|
182 |
+
@InProceedings{garcia2024_roberlexpt,
|
183 |
+
author="Garcia, Eduardo A. S.
|
184 |
+
and Silva, N{\'a}dia F. F.
|
185 |
+
and Siqueira, Felipe
|
186 |
+
and Gomes, Juliana R. S.
|
187 |
+
and Albuqueruqe, Hidelberg O.
|
188 |
+
and Souza, Ellen
|
189 |
+
and Lima, Eliomar
|
190 |
+
and De Carvalho, André",
|
191 |
+
title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
|
192 |
+
booktitle="Computational Processing of the Portuguese Language",
|
193 |
+
year="2024",
|
194 |
+
publisher="Association for Computational Linguistics"
|
195 |
+
}
|
196 |
+
```
|
197 |
+
|
198 |
+
## Acknowledgment
|
199 |
+
|
200 |
+
This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).
|