eduagarcia commited on
Commit
3171fcb
1 Parent(s): 3875afe

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -0
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - eduagarcia/CrawlPT_dedup
4
+ language:
5
+ - pt
6
+ pipeline_tag: fill-mask
7
+ model-index:
8
+ - name: RoBERTaCrawlPT-base
9
+ results:
10
+ - task:
11
+ type: token-classification
12
+ dataset:
13
+ type: lener_br
14
+ name: lener_br
15
+ split: test
16
+ metrics:
17
+ - type: seqeval
18
+ value: 0.8924
19
+ name: F1
20
+ args:
21
+ scheme: IOB2
22
+ - task:
23
+ type: token-classification
24
+ dataset:
25
+ type: eduagarcia/PortuLex_benchmark
26
+ name: UlyNER-PL Coarse
27
+ config: UlyssesNER-Br-PL-coarse
28
+ split: test
29
+ metrics:
30
+ - type: seqeval
31
+ value: 0.8822
32
+ name: F1
33
+ args:
34
+ scheme: IOB2
35
+ - task:
36
+ type: token-classification
37
+ dataset:
38
+ type: eduagarcia/PortuLex_benchmark
39
+ name: UlyNER-PL Fine
40
+ config: UlyssesNER-Br-PL-fine
41
+ split: test
42
+ metrics:
43
+ - type: seqeval
44
+ value: 0.8658
45
+ name: F1
46
+ args:
47
+ scheme: IOB2
48
+ - task:
49
+ type: token-classification
50
+ dataset:
51
+ type: eduagarcia/PortuLex_benchmark
52
+ name: FGV-STF
53
+ config: fgv-coarse
54
+ split: test
55
+ metrics:
56
+ - type: seqeval
57
+ value: 0.7988
58
+ name: F1
59
+ args:
60
+ scheme: IOB2
61
+ - task:
62
+ type: token-classification
63
+ dataset:
64
+ type: eduagarcia/PortuLex_benchmark
65
+ name: RRIP
66
+ config: rrip
67
+ split: test
68
+ metrics:
69
+ - type: seqeval
70
+ value: 0.8280
71
+ name: F1
72
+ args:
73
+ scheme: IOB2
74
+ - task:
75
+ type: token-classification
76
+ dataset:
77
+ type: eduagarcia/PortuLex_benchmark
78
+ name: PortuLex
79
+ split: test
80
+ metrics:
81
+ - type: seqeval
82
+ value: 0.8483
83
+ name: Average F1
84
+ args:
85
+ scheme: IOB2
86
+ license: cc-by-4.0
87
+ metrics:
88
+ - seqeval
89
+ ---
90
+ # RoBERTaCrawlPT-base
91
+
92
+ RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base).
93
+ This model is part of the [RoBERTaLexPT](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) work: [Coming soon]
94
+
95
+ - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
96
+ - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
97
+ - **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
98
+ - **Paper:** [Coming soon]
99
+
100
+ ## Generic Evaluation
101
+
102
+ TO-DO...
103
+
104
+ ## Legal Evaluation
105
+
106
+ The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
107
+
108
+ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
109
+
110
+ | **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
111
+ |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
112
+ | | | Coarse/Fine | Coarse | | |
113
+ | [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
114
+ | [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
115
+ | [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
116
+ | [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
117
+ | [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
118
+ | [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
119
+ | [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
120
+ | [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
121
+ | [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
122
+ | [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
123
+ | **Ours** | | | | | |
124
+ | RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
125
+ | RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
126
+ | **RoBERTaCrawlPT-base (this)** (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
127
+ | [RoBERTaLegalPT-base](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
128
+
129
+
130
+ ## Training Details
131
+
132
+ RoBERTaCrawlPT is pretrained on:
133
+ - [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
134
+
135
+ ### Training Procedure
136
+
137
+ Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
138
+ The complete training of a single configuration takes approximately three days.
139
+
140
+
141
+ This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
142
+
143
+ #### Preprocessing
144
+
145
+ We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
146
+
147
+ To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
148
+
149
+ #### Training Hyperparameters
150
+
151
+ The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
152
+ The weight initialization is random.
153
+ We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
154
+ The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
155
+
156
+ For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
157
+
158
+
159
+ | **Hyperparameter** | **RoBERTa-base** |
160
+ |------------------------|-----------------:|
161
+ | Number of layers | 12 |
162
+ | Hidden size | 768 |
163
+ | FFN inner hidden size | 3072 |
164
+ | Attention heads | 12 |
165
+ | Attention head size | 64 |
166
+ | Dropout | 0.1 |
167
+ | Attention dropout | 0.1 |
168
+ | Warmup steps | 6k |
169
+ | Peak learning rate | 4e-4 |
170
+ | Batch size | 2048 |
171
+ | Weight decay | 0.01 |
172
+ | Maximum training steps | 62.5k |
173
+ | Learning rate decay | Linear |
174
+ | AdamW $$\epsilon$$ | 1e-6 |
175
+ | AdamW $$\beta_1$$ | 0.9 |
176
+ | AdamW $$\beta_2$$ | 0.98 |
177
+ | Gradient clipping | 0.0 |
178
+
179
+ ## Citation
180
+
181
+ ```
182
+ @InProceedings{garcia2024_roberlexpt,
183
+ author="Garcia, Eduardo A. S.
184
+ and Silva, N{\'a}dia F. F.
185
+ and Siqueira, Felipe
186
+ and Gomes, Juliana R. S.
187
+ and Albuqueruqe, Hidelberg O.
188
+ and Souza, Ellen
189
+ and Lima, Eliomar
190
+ and De Carvalho, André",
191
+ title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
192
+ booktitle="Computational Processing of the Portuguese Language",
193
+ year="2024",
194
+ publisher="Association for Computational Linguistics"
195
+ }
196
+ ```
197
+
198
+ ## Acknowledgment
199
+
200
+ This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).