Kyrgyz Punctuation Restoration β€” XLM-RoBERTa

The first punctuation restoration model for the Kyrgyz language, achieving 94.1% precision and 90.3% F1-score β€” surpassing benchmarks for other low-resource languages.

πŸ“„ Published research: "AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language" β€” Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)


Highlights

  • πŸ† F1-score: 90.3% β€” outperforms comparable low-resource language models
  • 🌍 First-of-its-kind for Kyrgyz (Turkic language family, ~7M speakers)
  • ⚑ ONNX format β€” optimized for fast inference across frameworks
  • πŸŽ™οΈ ASR post-processing β€” designed to restore punctuation in speech-to-text output

Performance

Metric Score
Precision 94.1%
Recall 86.8%
F1-Score 90.3%

Cross-Lingual Comparison

Model Language F1-Score
Ours (XLM-RoBERTa) Kyrgyz 90.3%
Alam et al. (2020) English (clean) 87.0%
Alam et al. (2020) Bangla 69.5%
Nagy et al. (2021) Hungarian ~82.0%

The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.


Model Architecture

Parameter Value
Base model XLM-RoBERTa-base
Parameters ~270M
Transformer layers 12
Hidden dimensions 768
Attention heads 12
Export format ONNX

Training Details

Dataset

A custom-built 200 MB Kyrgyz text corpus, collected over 2 months:

Source Size Description
Kyrgyz-Turkish Manas University Library 135 MB Books (literature, math, physics)
Kyrgyz Wikipedia 40 MB Encyclopedia articles
News portals 25 MB Journalistic text

Preprocessing pipeline: PDF β†’ EasyOCR text extraction β†’ manual cleaning β†’ JSON formatting with punctuation labels.

Data Augmentation

Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:

  • Back-translation: Kyrgyz β†’ English β†’ Kyrgyz (simulating ASR-like errors)
  • Token-level modifications: Random insertions, deletions, swaps
  • Morphological transformations: Case form and morpheme modifications preserving grammatical correctness

Hyperparameters

Parameter Value
Batch size 32
Epochs 10
Optimizer Adam
Learning rate 5e-5
Regularization Dropout
Hardware Google Colab TPU
Training time 42 hours

How to Use

import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION

# Example inference
input_text = "Π±ΡƒΠ» ΠΊΡ‹Ρ€Π³Ρ‹Π· Ρ‚ΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ тСкст"
# Tokenize and run inference (see main.py for full pipeline)

Repository Structure

β”œβ”€β”€ model.onnx           # Trained model in ONNX format (1.11 GB)
β”œβ”€β”€ main.py              # Inference pipeline
β”œβ”€β”€ env.py               # Environment configuration
β”œβ”€β”€ config.yaml          # Hyperparameters and model config
β”œβ”€β”€ requirements.txt     # Python dependencies
└── Files/               # Additional model files

Intended Use

Use Case Description
ASR post-processing Restore punctuation in speech-to-text output for Kyrgyz
Text normalization Clean and format raw Kyrgyz text with proper punctuation
NLP preprocessing Improve downstream task performance (NER, MT, summarization)
Accessibility Enhance readability of automatically generated Kyrgyz content

Limitations

  • Rare punctuation marks: Lower accuracy on question marks and exclamation points due to class imbalance in training data
  • Formal text bias: Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
  • Morpheme boundary errors: Occasional difficulty placing punctuation in complex agglutinative constructions
  • Domain specificity: Best performance on prose-style text; specialized domains may require additional fine-tuning

Future Directions

  • Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
  • Morphology-aware tokenization to replace standard BPE
  • Expanded dataset with informal and conversational Kyrgyz text
  • Integration with Kyrgyz ASR systems for end-to-end speech processing

Citation

@article{uvalieva2024punctuation,
  author    = {Uvalieva, Zarina and Muhametjanova, Gulshat},
  title     = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
  year      = {2024},
  institution = {Kyrgyz-Turkish Manas University}
}

Author

Zarina Uvalieva β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support