RoBERTa Turkish medium Character-level (uncased)

Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased. The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.

Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 384.

The details and performance comparisons can be found at this paper: https://arxiv.org/abs/2204.08832

Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for model loading and tokenization, example max length(1024) can be changed:

    model = AutoModel.from_pretrained([model_path])
    #for sequence classification:
    #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])

    tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
    tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
    tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
    tokenizer.bos_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
    tokenizer.sep_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][2]
    tokenizer.eos_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][2]
    tokenizer.pad_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][3]
    tokenizer.unk_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][3]
    tokenizer.model_max_length = 1024

BibTeX entry and citation info

@misc{https://doi.org/10.48550/arxiv.2204.08832,
  doi = {10.48550/ARXIV.2204.08832},
  url = {https://arxiv.org/abs/2204.08832},
  author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
Downloads last month
8
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ctoraman/RoBERTa-TR-medium-char