TataKata: Indonesian BERT Language Model

TataKata is an Indonesian BERT model trained through continued pretraining of the original IndoBERT base architecture. The model is designed to enhance understanding of Indonesian grammar and word usage, aligning with KBBI (Kamus Besar Bahasa Indonesia) and PUEBI (Pedoman Umum Ejaan Bahasa Indonesia) standards.

Model Overview

Model Name: citylighxts/TataKata
Language: Indonesian (id)
Base Model: indobenchmark/indobert-base-p1
Architecture: BERT-base (12-layer, 768 hidden, 12 attention heads, 110M parameters)
Task: Masked Language Modeling (MLM)
License: Apache-2.0

Usage Example

You can easily load and use the model with the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("citylighxts/TataKata")
model = AutoModelForMaskedLM.from_pretrained("citylighxts/TataKata")

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
result = fill_mask("Saya pergi ke [MASK] sakit.")
print(result)

Example output:

[{'sequence': 'Saya pergi ke rumah sakit.', 'score': 0.88},
 {'sequence': 'Saya pergi ke klinik sakit.', 'score': 0.06},
 {'sequence': 'Saya pergi ke tempat sakit.', 'score': 0.03}]

Training Details

Objective: Continue pretraining BERT on a larger and cleaner Indonesian corpus focusing on proper grammar and contextual fluency.
Datasets: Combination of KBBI definitions, PUEBI examples, Indonesian Wikipedia, and public news datasets.
Preprocessing: Text normalization, sentence segmentation, lowercase conversion.
Tokenizer: WordPiece tokenizer trained from scratch with 32K vocabulary size.
Max sequence length: 512 tokens.
Masked Language Modeling probability: 0.15
Training epochs: 3
Batch size: 16
Optimizer: AdamW with linear learning rate decay.

Evaluation

TataKata achieves better perplexity scores than the base IndoBERT on evaluation corpora derived from KBBI and Indonesian Wikipedia (see detailed benchmark in paper or future updates).

Intended Use

This model is suitable for:

Grammar checking tasks in Indonesian.
Text completion and correction systems.
Language modeling for downstream NLP tasks such as text classification, QA, or summarization.

Limitations

May underperform on informal or slang-heavy Indonesian texts.
Not optimized for code-switching or mixed-language sentences.
Requires additional fine-tuning for specific downstream tasks.

Citation

If you use this model, please cite:

@misc{tatakata2025,
  title={TataKata: Indonesian BERT Language Model Aligned with KBBI and PUEBI},
  author={Hana Azizah},
  year={2025},
  howpublished={\url{https://huggingface.co/citylighxts/TataKata}}
}

License

This model is licensed under the Apache License 2.0.

Contact

For questions or collaborations, please contact: Author: Hana Azizah (citylighxts) Email: citylighxts@example.com Hugging Face: https://huggingface.co/citylighxts

Downloads last month: 147

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train citylighxts/TataKata

Evaluation results

Perplexity on Indonesian Wikipedia + KBBI
self-reported

12.400

View on Papers With Code