Kyrgyz Punctuation Restoration — XLM-RoBERTa

The first punctuation restoration model for the Kyrgyz language, achieving 94.1% precision and 90.3% F1-score — surpassing benchmarks for other low-resource languages.

📄 Published research: "AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language" — Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)

Highlights

🏆 F1-score: 90.3% — outperforms comparable low-resource language models
🌍 First-of-its-kind for Kyrgyz (Turkic language family, ~7M speakers)
⚡ ONNX format — optimized for fast inference across frameworks
🎙️ ASR post-processing — designed to restore punctuation in speech-to-text output

Performance

Metric	Score
Precision	94.1%
Recall	86.8%
F1-Score	90.3%

Cross-Lingual Comparison

Model	Language	F1-Score
Ours (XLM-RoBERTa)	Kyrgyz	90.3%
Alam et al. (2020)	English (clean)	87.0%
Alam et al. (2020)	Bangla	69.5%
Nagy et al. (2021)	Hungarian	~82.0%

The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.

Model Architecture

Parameter	Value
Base model	XLM-RoBERTa-base
Parameters	~270M
Transformer layers	12
Hidden dimensions	768
Attention heads	12
Export format	ONNX

Training Details

Dataset

A custom-built 200 MB Kyrgyz text corpus, collected over 2 months:

Source	Size	Description
Kyrgyz-Turkish Manas University Library	135 MB	Books (literature, math, physics)
Kyrgyz Wikipedia	40 MB	Encyclopedia articles
News portals	25 MB	Journalistic text

Preprocessing pipeline: PDF → EasyOCR text extraction → manual cleaning → JSON formatting with punctuation labels.

Data Augmentation

Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:

Back-translation: Kyrgyz → English → Kyrgyz (simulating ASR-like errors)
Token-level modifications: Random insertions, deletions, swaps
Morphological transformations: Case form and morpheme modifications preserving grammatical correctness

Hyperparameters

Parameter	Value
Batch size	32
Epochs	10
Optimizer	Adam
Learning rate	5e-5
Regularization	Dropout
Hardware	Google Colab TPU
Training time	42 hours

How to Use

import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION

# Example inference
input_text = "бул кыргыз тилиндеги текст"
# Tokenize and run inference (see main.py for full pipeline)

Repository Structure

├── model.onnx           # Trained model in ONNX format (1.11 GB)
├── main.py              # Inference pipeline
├── env.py               # Environment configuration
├── config.yaml          # Hyperparameters and model config
├── requirements.txt     # Python dependencies
└── Files/               # Additional model files

Intended Use

Use Case	Description
ASR post-processing	Restore punctuation in speech-to-text output for Kyrgyz
Text normalization	Clean and format raw Kyrgyz text with proper punctuation
NLP preprocessing	Improve downstream task performance (NER, MT, summarization)
Accessibility	Enhance readability of automatically generated Kyrgyz content

Limitations

Rare punctuation marks: Lower accuracy on question marks and exclamation points due to class imbalance in training data
Formal text bias: Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
Morpheme boundary errors: Occasional difficulty placing punctuation in complex agglutinative constructions
Domain specificity: Best performance on prose-style text; specialized domains may require additional fine-tuning

Future Directions

Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
Morphology-aware tokenization to replace standard BPE
Expanded dataset with informal and conversational Kyrgyz text
Integration with Kyrgyz ASR systems for end-to-end speech processing

Citation

@article{uvalieva2024punctuation,
  author    = {Uvalieva, Zarina and Muhametjanova, Gulshat},
  title     = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
  year      = {2024},
  institution = {Kyrgyz-Turkish Manas University}
}

Author

Zarina Uvalieva — ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

🤗 HuggingFace
📧 zarina.uvalievaa@gmail.com

Downloads last month: 5