Kyrgyz Punctuation Restoration β XLM-RoBERTa
The first punctuation restoration model for the Kyrgyz language, achieving 94.1% precision and 90.3% F1-score β surpassing benchmarks for other low-resource languages.
π Published research: "AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language" β Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)
Highlights
- π F1-score: 90.3% β outperforms comparable low-resource language models
- π First-of-its-kind for Kyrgyz (Turkic language family, ~7M speakers)
- β‘ ONNX format β optimized for fast inference across frameworks
- ποΈ ASR post-processing β designed to restore punctuation in speech-to-text output
Performance
| Metric | Score |
|---|---|
| Precision | 94.1% |
| Recall | 86.8% |
| F1-Score | 90.3% |
Cross-Lingual Comparison
| Model | Language | F1-Score |
|---|---|---|
| Ours (XLM-RoBERTa) | Kyrgyz | 90.3% |
| Alam et al. (2020) | English (clean) | 87.0% |
| Alam et al. (2020) | Bangla | 69.5% |
| Nagy et al. (2021) | Hungarian | ~82.0% |
The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.
Model Architecture
| Parameter | Value |
|---|---|
| Base model | XLM-RoBERTa-base |
| Parameters | ~270M |
| Transformer layers | 12 |
| Hidden dimensions | 768 |
| Attention heads | 12 |
| Export format | ONNX |
Training Details
Dataset
A custom-built 200 MB Kyrgyz text corpus, collected over 2 months:
| Source | Size | Description |
|---|---|---|
| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
| News portals | 25 MB | Journalistic text |
Preprocessing pipeline: PDF β EasyOCR text extraction β manual cleaning β JSON formatting with punctuation labels.
Data Augmentation
Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:
- Back-translation: Kyrgyz β English β Kyrgyz (simulating ASR-like errors)
- Token-level modifications: Random insertions, deletions, swaps
- Morphological transformations: Case form and morpheme modifications preserving grammatical correctness
Hyperparameters
| Parameter | Value |
|---|---|
| Batch size | 32 |
| Epochs | 10 |
| Optimizer | Adam |
| Learning rate | 5e-5 |
| Regularization | Dropout |
| Hardware | Google Colab TPU |
| Training time | 42 hours |
How to Use
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("model.onnx")
# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION
# Example inference
input_text = "Π±ΡΠ» ΠΊΡΡΠ³ΡΠ· ΡΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ ΡΠ΅ΠΊΡΡ"
# Tokenize and run inference (see main.py for full pipeline)
Repository Structure
βββ model.onnx # Trained model in ONNX format (1.11 GB)
βββ main.py # Inference pipeline
βββ env.py # Environment configuration
βββ config.yaml # Hyperparameters and model config
βββ requirements.txt # Python dependencies
βββ Files/ # Additional model files
Intended Use
| Use Case | Description |
|---|---|
| ASR post-processing | Restore punctuation in speech-to-text output for Kyrgyz |
| Text normalization | Clean and format raw Kyrgyz text with proper punctuation |
| NLP preprocessing | Improve downstream task performance (NER, MT, summarization) |
| Accessibility | Enhance readability of automatically generated Kyrgyz content |
Limitations
- Rare punctuation marks: Lower accuracy on question marks and exclamation points due to class imbalance in training data
- Formal text bias: Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
- Morpheme boundary errors: Occasional difficulty placing punctuation in complex agglutinative constructions
- Domain specificity: Best performance on prose-style text; specialized domains may require additional fine-tuning
Future Directions
- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
- Morphology-aware tokenization to replace standard BPE
- Expanded dataset with informal and conversational Kyrgyz text
- Integration with Kyrgyz ASR systems for end-to-end speech processing
Citation
@article{uvalieva2024punctuation,
author = {Uvalieva, Zarina and Muhametjanova, Gulshat},
title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
year = {2024},
institution = {Kyrgyz-Turkish Manas University}
}
Author
Zarina Uvalieva β ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
- π€ HuggingFace
- π§ zarina.uvalievaa@gmail.com
- Downloads last month
- 5