eduagarcia commited on
Commit
e9d7237
1 Parent(s): 6d9f1d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -57,7 +57,7 @@ metrics:
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
- RoBERTaLexPT-base is pretrained from LegalPT and CrawlPT corpora, using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
61
 
62
  - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
63
  - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
@@ -83,9 +83,11 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
83
  | [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
84
  | [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
85
  | [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
86
- | RoBERTaTimbau-base | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
87
- | RoBERTaLegalPT-base | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
88
- | RoBERTaLexPT-base | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
 
 
89
 
90
  In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
91
  With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
 
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
+ RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) and CrawlPT corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
61
 
62
  - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
63
  - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
 
83
  | [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
84
  | [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
85
  | [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
86
+ | **Ours** | | | | | |
87
+ | RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
88
+ | RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
89
+ | RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
90
+ | RoBERTaLexPT-base (this) (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
91
 
92
  In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
93
  With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.