dumitrescustefan's picture
Update README.md
5b4e03e
metadata
language: ro
tags:
  - bert
  - fill-mask
license: mit

bert-base-romanian-uncased-v1

The BERT base, uncased model for Romanian, trained on a 15GB corpus, version v1.0

How to use

from transformers import AutoTokenizer, AutoModel
import torch

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")

# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)

# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Remember to always sanitize your text! Replace s and t cedilla-letters to comma-letters with :

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

because the model was NOT trained on cedilla s and ts. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.

Evaluation

Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.

The baseline is the Multilingual BERT model bert-base-multilingual-(un)cased, as at the time of writing it was the only available BERT model that works on Romanian.

Model UPOS XPOS NER LAS
bert-base-multilingual-uncased 97.65 95.72 83.91 87.65
bert-base-romanian-uncased-v1 98.18 96.84 85.26 89.61

Corpus

The model is trained on the following corpora (stats in the table below are after cleaning):

Corpus Lines(M) Words(M) Chars(B) Size(GB)
OPUS 55.05 635.04 4.045 3.8
OSCAR 33.56 1725.82 11.411 11
Wikipedia 1.54 60.47 0.411 0.4
Total 90.15 2421.33 15.867 15.2

Citation

If you use this model in a research paper, I'd kindly ask you to cite the following paper:

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

or, in bibtex:

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}

Acknowledgements

  • We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!