eduagarcia
/

RoBERTaCrawlPT-base

+---
+datasets:
+- eduagarcia/CrawlPT_dedup
+language:
+- pt
+pipeline_tag: fill-mask
+model-index:
+- name: RoBERTaCrawlPT-base
+  results:
+  - task:
+      type: token-classification
+    dataset:
+      type: lener_br
+      name: lener_br
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8924
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: UlyNER-PL Coarse
+      config: UlyssesNER-Br-PL-coarse
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8822
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: UlyNER-PL Fine
+      config: UlyssesNER-Br-PL-fine
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8658
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: FGV-STF
+      config: fgv-coarse
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.7988
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: RRIP
+      config: rrip
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8280
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: PortuLex
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8483
+      name: Average F1
+      args:
+        scheme: IOB2
+license: cc-by-4.0
+metrics:
+- seqeval
+---
+# RoBERTaCrawlPT-base
+RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base).
+This model is part of the [RoBERTaLexPT](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) work: [Coming soon]
+- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
+- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
+- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
+- **Paper:** [Coming soon]
+## Generic Evaluation
+TO-DO...
+## Legal Evaluation
+The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
+Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
+| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
+|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
+|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
+| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
+| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
+| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
+| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
+| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
+| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
+| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
+| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
+| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
+| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
+| **Ours**                                                                   |           |                 |             |           |                 |
+| RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
+| RoBERTaLegalPT-base (Trained on LegalPT)    | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
+| **RoBERTaCrawlPT-base (this)**  (Trained on CrawlPT)    | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
+| [RoBERTaLegalPT-base](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |
+## Training Details
+RoBERTaCrawlPT is pretrained on:
+- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
+### Training Procedure
+Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
+The complete training of a single configuration takes approximately three days.
+This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
+#### Preprocessing
+We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
+To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
+#### Training Hyperparameters
+The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
+The weight initialization is random.
+We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
+The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
+For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
+| **Hyperparameter**     | **RoBERTa-base** |
+|------------------------|-----------------:|
+| Number of layers       |               12 |
+| Hidden size            |              768 |
+| FFN inner hidden size  |             3072 |
+| Attention heads        |               12 |
+| Attention head size    |               64 |
+| Dropout                |              0.1 |
+| Attention dropout      |              0.1 |
+| Warmup steps           |               6k |
+| Peak learning rate     |             4e-4 |
+| Batch size             |             2048 |
+| Weight decay           |             0.01 |
+| Maximum training steps |            62.5k |
+| Learning rate decay    |           Linear |
+| AdamW $$\epsilon$$     |             1e-6 |
+| AdamW $$\beta_1$$      |              0.9 |
+| AdamW $$\beta_2$$      |             0.98 |
+| Gradient clipping      |              0.0 |
+## Citation
+```
+@InProceedings{garcia2024_roberlexpt,
+    author="Garcia, Eduardo A. S.
+    and Silva, N{\'a}dia F. F.
+    and Siqueira, Felipe
+    and Gomes, Juliana R. S.
+    and Albuqueruqe, Hidelberg O.
+    and Souza, Ellen
+    and Lima, Eliomar
+    and De Carvalho, André",
+    title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
+    booktitle="Computational Processing of the Portuguese Language",
+    year="2024",
+    publisher="Association for Computational Linguistics"
+}
+```
+## Acknowledgment
+This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).