TataKata: Indonesian BERT Language Model

TataKata is an Indonesian BERT model trained through continued pretraining of the original IndoBERT base architecture. The model is designed to enhance understanding of Indonesian grammar and word usage, aligning with KBBI (Kamus Besar Bahasa Indonesia) and PUEBI (Pedoman Umum Ejaan Bahasa Indonesia) standards.


Model Overview

  • Model Name: citylighxts/TataKata
  • Language: Indonesian (id)
  • Base Model: indobenchmark/indobert-base-p1
  • Architecture: BERT-base (12-layer, 768 hidden, 12 attention heads, 110M parameters)
  • Task: Masked Language Modeling (MLM)
  • License: Apache-2.0

Usage Example

You can easily load and use the model with the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("citylighxts/TataKata")
model = AutoModelForMaskedLM.from_pretrained("citylighxts/TataKata")

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
result = fill_mask("Saya pergi ke [MASK] sakit.")
print(result)

Example output:

[{'sequence': 'Saya pergi ke rumah sakit.', 'score': 0.88},
 {'sequence': 'Saya pergi ke klinik sakit.', 'score': 0.06},
 {'sequence': 'Saya pergi ke tempat sakit.', 'score': 0.03}]

Training Details

  • Objective: Continue pretraining BERT on a larger and cleaner Indonesian corpus focusing on proper grammar and contextual fluency.
  • Datasets: Combination of KBBI definitions, PUEBI examples, Indonesian Wikipedia, and public news datasets.
  • Preprocessing: Text normalization, sentence segmentation, lowercase conversion.
  • Tokenizer: WordPiece tokenizer trained from scratch with 32K vocabulary size.
  • Max sequence length: 512 tokens.
  • Masked Language Modeling probability: 0.15
  • Training epochs: 3
  • Batch size: 16
  • Optimizer: AdamW with linear learning rate decay.

Evaluation

TataKata achieves better perplexity scores than the base IndoBERT on evaluation corpora derived from KBBI and Indonesian Wikipedia (see detailed benchmark in paper or future updates).


Intended Use

This model is suitable for:

  • Grammar checking tasks in Indonesian.
  • Text completion and correction systems.
  • Language modeling for downstream NLP tasks such as text classification, QA, or summarization.

Limitations

  • May underperform on informal or slang-heavy Indonesian texts.
  • Not optimized for code-switching or mixed-language sentences.
  • Requires additional fine-tuning for specific downstream tasks.

Citation

If you use this model, please cite:

@misc{tatakata2025,
  title={TataKata: Indonesian BERT Language Model Aligned with KBBI and PUEBI},
  author={Hana Azizah},
  year={2025},
  howpublished={\url{https://huggingface.co/citylighxts/TataKata}}
}

License

This model is licensed under the Apache License 2.0.


Contact

For questions or collaborations, please contact: Author: Hana Azizah (citylighxts) Email: citylighxts@example.com Hugging Face: https://huggingface.co/citylighxts

Downloads last month
147
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train citylighxts/TataKata

Evaluation results