Kaz-RoBERTa (base-sized model)

Model description

Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.

Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
#Out:
# {'score': 0.8131822347640991,
#   'token': 18749,
#   'token_str': ' мағынада',
#   'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
# ...
# ...]

Training data

The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:

MDBKD Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
Conversational data Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)

Together these datasets weigh 25GB of text.

Training procedure

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s>

Pretraining

The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, num_attention_heads=12, num_hidden_layers=6.

Citation

If you use Kaz-RoBERTa Conversational, please cite:

Cite as: Beksultan Sagyndyk, Sanzhar Murzakhmetov, Kirill Yakunin. Kaz-RoBERTa Conversational Technical Report. TechRxiv. October 02, 2025.
DOI: 10.36227/techrxiv.175942902.25827042/v1

BibTeX

@misc{Sagyndyk2025KazRobertaConversational,
  title  = {Kaz-RoBERTa Conversational Technical Report},
  author = {Beksultan Sagyndyk and Sanzhar Murzakhmetov and Kirill Yakunin},
  year   = {2025},
  publisher = {TechRxiv},
  doi    = {10.36227/techrxiv.175942902.25827042/v1},
  url    = {https://doi.org/10.36227/techrxiv.175942902.25827042/v1}
}

Downloads last month: 10,640

Safetensors

Model size

83.5M params

Tensor type

I64

F32

Model tree for kz-transformers/kaz-roberta-conversational

Adapters

1 model

Finetunes

8 models

kz-transformers
/

kaz-roberta-conversational