|
--- |
|
language: |
|
- pt |
|
tags: |
|
- albertina-pt* |
|
- albertina-ptpt |
|
- albertina-ptbr |
|
- albertina-ptbr-nobrwac |
|
- fill-mask |
|
- bert |
|
- deberta |
|
- portuguese |
|
- encoder |
|
- foundation model |
|
license: other |
|
datasets: |
|
- PORTULAN/glue-ptpt |
|
- assin2 |
|
- dlb/plue |
|
widget: |
|
- text: >- |
|
A culinária brasileira é rica em sabores e [MASK], tornando-se um dos |
|
maiores patrimônios do país. |
|
--- |
|
--- |
|
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png"> |
|
<p style="text-align: center;"> This is the model card for Albertina 900m PT-PT No-brWaC |
|
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>. |
|
</p> |
|
|
|
--- |
|
|
|
# Albertina PT-BR No-brWaC |
|
|
|
|
|
**Albertina PT-*** is a foundation, large language model for the **Portuguese language**. |
|
|
|
It is an **encoder** of the BERT family, based on the neural architecture Transformer and |
|
developed over the DeBERTa model, and with most competitive performance for this language. |
|
It has different versions that were trained for different variants of Portuguese (PT), |
|
namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**), |
|
and it is distributed free of charge and under a most permissible license. |
|
|
|
| Albertina's Family of Models | |
|
|----------------------------------------------------------------------------------------------------------| |
|
| [**Albertina 1.5B PTPT**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptpt-encoder) | |
|
| [**Albertina 1.5B PTBR**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptbr-encoder) | |
|
| [**Albertina 1.5B PTPT 256**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptpt-encoder-256)| |
|
| [**Albertina 1.5B PTBR 256**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptbr-encoder-256)| |
|
| [**Albertina 900M PTPT**](https://huggingface.co/PORTULAN/albertina-900m-portuguese-ptpt-encoder) | |
|
| [**Albertina 900M PTBR**](https://huggingface.co/PORTULAN/albertina-900m-portuguese-ptbr-encoder) | |
|
| [**Albertina 100M PTPT**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptpt-encoder) | |
|
| [**Albertina 100M PTBR**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptbr-encoder) | |
|
|
|
**Albertina PT-BR No-brWaC** is a version for American **Portuguese** from **Brazil** trained on |
|
data sets other than brWaC, and thus with a most permissive license. |
|
|
|
You may be interested also in [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), trained on brWaC. |
|
To the best of our knowledge, these are encoders specifically for this language and variant |
|
that set a new state of the art for it, and is made publicly available |
|
and distributed for reuse. |
|
|
|
|
|
**Albertina PT-BR No-brWaC** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal. |
|
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721): |
|
|
|
``` latex |
|
@misc{albertina-pt, |
|
title={Advancing Neural Encoding of Portuguese |
|
with Transformer Albertina PT-*}, |
|
author={João Rodrigues and Luís Gomes and João Silva and |
|
António Branco and Rodrigo Santos and |
|
Henrique Lopes Cardoso and Tomás Osório}, |
|
year={2023}, |
|
eprint={2305.06721}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
Please use the above cannonical reference when using or citing this model. |
|
|
|
<br> |
|
|
|
|
|
# Model Description |
|
|
|
**This model card is for Albertina-PT-BR No-brWaC**, with 900M parameters, 24 layers and a hidden size of 1536. |
|
|
|
Albertina-PT-BR No-brWaC is distributed under an [MIT license](https://huggingface.co/PORTULAN/albertina-ptpt/blob/main/LICENSE). |
|
|
|
DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBERTa/blob/master/LICENSE). |
|
|
|
|
|
<br> |
|
|
|
# Training Data |
|
|
|
|
|
**Albertina PT-BR No-brWac** was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set. |
|
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. |
|
It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. |
|
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil. |
|
We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl. |
|
|
|
## Preprocessing |
|
|
|
We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline. |
|
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese. |
|
|
|
|
|
## Training |
|
|
|
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English. |
|
|
|
To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-ptbr-nobrwac), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding. |
|
The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU). |
|
We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps. |
|
In total, around 200k training steps were taken across 50 epochs. |
|
The model was trained for 1 day and 13 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM. |
|
|
|
<br> |
|
|
|
# Evaluation |
|
|
|
The two model versions were evaluated on downstream tasks organized into two groups. |
|
|
|
In one group, we have the two data sets from the [ASSIN 2 benchmark](https://huggingface.co/datasets/assin2), namely STS and RTE, that were used to evaluate the previous state-of-the-art model [BERTimbau Large](https://huggingface.co/neuralmind/bert-large-portuguese-cased). |
|
In the other group of data sets, we have the translations into PT-BR of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue), which allowed us to test both Albertina-PT-* variants on a wider variety of downstream tasks. |
|
|
|
|
|
## ASSIN 2 |
|
|
|
[ASSIN 2](https://huggingface.co/datasets/assin2) is a **PT-BR data** set of approximately 10.000 sentence pairs, split into 6.500 for training, 500 for validation, and 2.448 for testing, annotated with semantic relatedness scores (range 1 to 5) and with binary entailment judgments. |
|
This data set supports the task of semantic textual similarity (STS), which consists of assigning a score of how semantically related two sentences are; and the task of recognizing textual entailment (RTE), which given a pair of sentences, consists of determining whether the first entails the second. |
|
|
|
| Model | RTE (Accuracy) | STS (Pearson)| |
|
|------------------------------|----------------|--------------| |
|
| **Albertina-PT-BR** | **0.9130** | **0.8676** | |
|
| **Albertina-PT-BR No-brWaC** | 0.8950 | 0.8547 | |
|
|
|
|
|
## GLUE tasks translated |
|
|
|
We resort to [PLUE](https://huggingface.co/datasets/dlb/plue) (Portuguese Language Understanding Evaluation), a data set that was obtained by automatically translating GLUE into **PT-BR**. |
|
We address four tasks from those in PLUE, namely: |
|
- two similarity tasks: MRPC, for detecting whether two sentences are paraphrases of each other, and STS-B, for semantic textual similarity; |
|
- and two inference tasks: RTE, for recognizing textual entailment and WNLI, for coreference and natural language inference. |
|
|
|
|
|
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) | |
|
|------------------------------|----------------|----------------|-----------|-----------------| |
|
| **Albertina-PT-BR No-brWaC** | **0.7798** | **0.5070** | **0.9167**| 0.8743 |
|
| **Albertina-PT-BR** | 0.7545 | 0.4601 | 0.9071 | **0.8910** | |
|
|
|
|
|
|
|
<br> |
|
|
|
# How to use |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-ptbr-nobrwac') |
|
>>> unmasker("A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores patrimônios do país.") |
|
|
|
[{'score': 0.3866911828517914, 'token': 23395, 'token_str': 'aromas', 'sequence': 'A culinária brasileira é rica em sabores e aromas, tornando-se um dos maiores patrimônios do país.'}, |
|
{'score': 0.2926434874534607, 'token': 10392, 'token_str': 'costumes', 'sequence': 'A culinária brasileira é rica em sabores e costumes, tornando-se um dos maiores patrimônios do país.'}, |
|
{'score': 0.1913347691297531, 'token': 21925, 'token_str': 'cores', 'sequence': 'A culinária brasileira é rica em sabores e cores, tornando-se um dos maiores patrimônios do país.'}, |
|
{'score': 0.06453365087509155, 'token': 117371, 'token_str': 'cultura', 'sequence': 'A culinária brasileira é rica em sabores e cultura, tornando-se um dos maiores patrimônios do país.'}, |
|
{'score': 0.019388679414987564, 'token': 22647, 'token_str': 'nuances', 'sequence': 'A culinária brasileira é rica em sabores e nuances, tornando-se um dos maiores patrimônios do país.'}] |
|
|
|
|
|
``` |
|
|
|
The model can be used by fine-tuning it for a specific task: |
|
|
|
```python |
|
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer |
|
>>> from datasets import load_dataset |
|
|
|
>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-ptbr-nobrwac", num_labels=2) |
|
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-ptbr-nobrwac") |
|
>>> dataset = load_dataset("PORTULAN/glue-ptpt", "rte") |
|
|
|
>>> def tokenize_function(examples): |
|
... return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True) |
|
|
|
>>> tokenized_datasets = dataset.map(tokenize_function, batched=True) |
|
|
|
>>> training_args = TrainingArguments(output_dir="albertina-ptbr-rte", evaluation_strategy="epoch") |
|
>>> trainer = Trainer( |
|
... model=model, |
|
... args=training_args, |
|
... train_dataset=tokenized_datasets["train"], |
|
... eval_dataset=tokenized_datasets["validation"], |
|
... ) |
|
|
|
>>> trainer.train() |
|
|
|
``` |
|
|
|
<br> |
|
|
|
# Citation |
|
|
|
When using or citing this model, kindly cite the following [publication](https://arxiv.org/abs/2305.06721): |
|
|
|
``` latex |
|
@misc{albertina-pt, |
|
title={Advancing Neural Encoding of Portuguese |
|
with Transformer Albertina PT-*}, |
|
author={João Rodrigues and Luís Gomes and João Silva and |
|
António Branco and Rodrigo Santos and |
|
Henrique Lopes Cardoso and Tomás Osório}, |
|
year={2023}, |
|
eprint={2305.06721}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
<br> |
|
|
|
# Acknowledgments |
|
|
|
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, |
|
funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the |
|
grant PINFRA/22117/2016; research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the |
|
grant CPCA-IAC/AV/478394/2022; innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; and LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020. |