TataKata: Indonesian BERT Language Model
TataKata is an Indonesian BERT model trained through continued pretraining of the original IndoBERT base architecture. The model is designed to enhance understanding of Indonesian grammar and word usage, aligning with KBBI (Kamus Besar Bahasa Indonesia) and PUEBI (Pedoman Umum Ejaan Bahasa Indonesia) standards.
Model Overview
- Model Name: citylighxts/TataKata
- Language: Indonesian (id)
- Base Model: indobenchmark/indobert-base-p1
- Architecture: BERT-base (12-layer, 768 hidden, 12 attention heads, 110M parameters)
- Task: Masked Language Modeling (MLM)
- License: Apache-2.0
Usage Example
You can easily load and use the model with the Hugging Face Transformers library:
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("citylighxts/TataKata")
model = AutoModelForMaskedLM.from_pretrained("citylighxts/TataKata")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
result = fill_mask("Saya pergi ke [MASK] sakit.")
print(result)
Example output:
[{'sequence': 'Saya pergi ke rumah sakit.', 'score': 0.88},
{'sequence': 'Saya pergi ke klinik sakit.', 'score': 0.06},
{'sequence': 'Saya pergi ke tempat sakit.', 'score': 0.03}]
Training Details
- Objective: Continue pretraining BERT on a larger and cleaner Indonesian corpus focusing on proper grammar and contextual fluency.
- Datasets: Combination of KBBI definitions, PUEBI examples, Indonesian Wikipedia, and public news datasets.
- Preprocessing: Text normalization, sentence segmentation, lowercase conversion.
- Tokenizer: WordPiece tokenizer trained from scratch with 32K vocabulary size.
- Max sequence length: 512 tokens.
- Masked Language Modeling probability: 0.15
- Training epochs: 3
- Batch size: 16
- Optimizer: AdamW with linear learning rate decay.
Evaluation
TataKata achieves better perplexity scores than the base IndoBERT on evaluation corpora derived from KBBI and Indonesian Wikipedia (see detailed benchmark in paper or future updates).
Intended Use
This model is suitable for:
- Grammar checking tasks in Indonesian.
- Text completion and correction systems.
- Language modeling for downstream NLP tasks such as text classification, QA, or summarization.
Limitations
- May underperform on informal or slang-heavy Indonesian texts.
- Not optimized for code-switching or mixed-language sentences.
- Requires additional fine-tuning for specific downstream tasks.
Citation
If you use this model, please cite:
@misc{tatakata2025,
title={TataKata: Indonesian BERT Language Model Aligned with KBBI and PUEBI},
author={Hana Azizah},
year={2025},
howpublished={\url{https://huggingface.co/citylighxts/TataKata}}
}
License
This model is licensed under the Apache License 2.0.
Contact
For questions or collaborations, please contact: Author: Hana Azizah (citylighxts) Email: citylighxts@example.com Hugging Face: https://huggingface.co/citylighxts
- Downloads last month
- 147
Dataset used to train citylighxts/TataKata
Evaluation results
- Perplexity on Indonesian Wikipedia + KBBIself-reported12.400