---
language: pt
license: mit
tags:
  - bert
  - pytorch
datasets:
  - Twitter
---

**Paper:** For more details, please refer to our paper: [BERTabaporu: Assessing a Genre-Specific Language Model for Portuguese NLP](https://aclanthology.org/2023.ranlp-1.24/) 


## Introduction

BERTabaporu is a Brazilian Portuguese BERT model in the Twitter domain. The model has been built from a collection of 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion tokens in total. 

## Available models

| Model                                    | Arch.      | #Layers | #Params |
| ---------------------------------------- | ---------- | ------- | ------- |
| `pablocosta/bertabaporu-base-uncased`    | BERT-Base  | 12      | 110M    |
| `pablocosta/bertabaporu-large-uncased`   | BERT-Large | 24      | 335M    |

## Usage

```python
from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased')
```


## Cite us


@inproceedings{costa-etal-2023-bertabaporu,
    title = "{BERT}abaporu: Assessing a Genre-Specific Language Model for {P}ortuguese {NLP}",
    author = "Costa, Pablo Botton  and
      Pavan, Matheus Camasmie  and
      Santos, Wesley Ramos  and
      Silva, Samuel Caetano  and
      Paraboni, Ivandr{\'e}",
    editor = "Mitkov, Ruslan  and
      Angelova, Galia",
    booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.ranlp-1.24",
    pages = "217--223",
    abstract = "Transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT) are now mainstream in the NLP field, but extensions to languages other than English, to new domains and/or to more specific text genres are still in demand. In this paper we introduced BERTabaporu, a BERT language model that has been pre-trained on Twitter data in the Brazilian Portuguese language. The model is shown to outperform the best-known general-purpose model for this language in three Twitter-related NLP tasks, making a potentially useful resource for Portuguese NLP in general.",
}