gonzalez-agirre
commited on
Commit
•
abbe4e5
1
Parent(s):
484c277
Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ tags:
|
|
11 |
|
12 |
- "masked-lm"
|
13 |
|
14 |
-
- "RoBERTa-large-ca"
|
15 |
|
16 |
- "CaText"
|
17 |
|
@@ -29,7 +29,7 @@ widget:
|
|
29 |
|
30 |
---
|
31 |
|
32 |
-
# Catalan BERTa (roberta-large-ca) large model
|
33 |
|
34 |
## Table of Contents
|
35 |
<details>
|
@@ -53,13 +53,13 @@ widget:
|
|
53 |
|
54 |
## Model description
|
55 |
|
56 |
-
The **roberta-large-ca** is a transformer-based masked language model for the Catalan language.
|
57 |
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) large model
|
58 |
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
59 |
|
60 |
## Intended Uses and Limitations
|
61 |
|
62 |
-
**roberta-large-ca** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
63 |
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
64 |
|
65 |
## How to Use
|
@@ -70,8 +70,8 @@ Here is how to use this model:
|
|
70 |
from transformers import AutoModelForMaskedLM
|
71 |
from transformers import AutoTokenizer, FillMaskPipeline
|
72 |
from pprint import pprint
|
73 |
-
tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/roberta-large-ca')
|
74 |
-
model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-large-ca')
|
75 |
model.eval()
|
76 |
pipeline = FillMaskPipeline(model, tokenizer_hf)
|
77 |
text = f"Em dic <mask>."
|
@@ -171,7 +171,7 @@ Here are the train/dev/test splits of the datasets:
|
|
171 |
|
172 |
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
173 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
174 |
-
| RoBERTa-large-ca | **89.82** | **99.02** | **83.41** | **75.46** | **83.61** | **89.34/75.50** | **89.20**/75.77 | **90.72/79.06** | **73.79**/55.34 |
|
175 |
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/**75.91** | 89.50/76.63 | 73.64/**55.42** |
|
176 |
| BERTa | 89.76 | 98.96 | 80.19 | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/77.14 | 69.20/51.47 |
|
177 |
| mBERT | 86.87 | 98.83 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
|
|
|
11 |
|
12 |
- "masked-lm"
|
13 |
|
14 |
+
- "RoBERTa-large-ca-v2"
|
15 |
|
16 |
- "CaText"
|
17 |
|
|
|
29 |
|
30 |
---
|
31 |
|
32 |
+
# Catalan BERTa (roberta-large-ca-v2) large model
|
33 |
|
34 |
## Table of Contents
|
35 |
<details>
|
|
|
53 |
|
54 |
## Model description
|
55 |
|
56 |
+
The **roberta-large-ca-v2** is a transformer-based masked language model for the Catalan language.
|
57 |
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) large model
|
58 |
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
59 |
|
60 |
## Intended Uses and Limitations
|
61 |
|
62 |
+
**roberta-large-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
63 |
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
64 |
|
65 |
## How to Use
|
|
|
70 |
from transformers import AutoModelForMaskedLM
|
71 |
from transformers import AutoTokenizer, FillMaskPipeline
|
72 |
from pprint import pprint
|
73 |
+
tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/roberta-large-ca-v2')
|
74 |
+
model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-large-ca-v2')
|
75 |
model.eval()
|
76 |
pipeline = FillMaskPipeline(model, tokenizer_hf)
|
77 |
text = f"Em dic <mask>."
|
|
|
171 |
|
172 |
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
173 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
174 |
+
| RoBERTa-large-ca-v2 | **89.82** | **99.02** | **83.41** | **75.46** | **83.61** | **89.34/75.50** | **89.20**/75.77 | **90.72/79.06** | **73.79**/55.34 |
|
175 |
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/**75.91** | 89.50/76.63 | 73.64/**55.42** |
|
176 |
| BERTa | 89.76 | 98.96 | 80.19 | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/77.14 | 69.20/51.47 |
|
177 |
| mBERT | 86.87 | 98.83 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
|