KHLR: Kurdish Handwritten Line Recognition

A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation

This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.


Repository Structure

KHLR/
β”œβ”€β”€ Kurdish-HLR-Model/      # Best Kurdish model (safetensors + config)
β”œβ”€β”€ Arabic-HLR-Model/         # Fine-tuned on KHATT Arabic dataset
β”œβ”€β”€ Urdu-HLR-Model/           # Fine-tuned on PUCIT Urdu dataset
β”œβ”€β”€ Scripts/
β”‚   β”œβ”€β”€ train.py                # Main training script
β”‚   β”œβ”€β”€ synthetic_line_generator.py  # Recipe-based synthetic line generation
β”‚   └── inference.py            # Single image / batch inference
β”œβ”€β”€ Sample/
β”‚   β”œβ”€β”€ sample_image.tif        # Example handwritten line image
β”‚   └── sample_image.txt        # Corresponding ground truth
β”œβ”€β”€ requirements.txt
└── README.md

Architecture

Component Details
CNN Backbone DenseNet-121 (ImageNet pre-trained)
Encoder 3 Transformer encoder layers
Decoder 3 Transformer decoder layers
Attention Heads 8
Hidden Size 256
Feed-Forward Dim 1024
Total Parameters ~12.8M

Performance

Kurdish (DASTNUS)

Configuration CER WER CRR (%)
+AA+SKHL+FHL-50 0.0593 0.3083 94.07
+AA+SKHL+FHL-50 + 8-gram LM 0.0534 0.2746 94.66

Cross-Dataset Generalization

Dataset Language CER WER CRR (%)
KHATT Arabic 0.1135 0.4156 88.65
PUCIT Urdu 0.0932 0.2799 90.68

Installation

git clone https://huggingface.co/karez/KHLR
cd KHLR
pip install -r requirements.txt

Quick Start

Inference

# Single image (using .pth checkpoint)
python Scripts/inference.py \
    --image Sample/sample_image.tif \
    --model_path Kurdish-HLR-Model/model.safetensors \
    --vocab_path Kurdish-HLR-Model/vocab.json

# Directory of images
python Scripts/inference.py \
    --image_dir ./test_images \
    --model_path Kurdish-HLR-Model/model.safetensors \
    --vocab_path Kurdish-HLR-Model/vocab.json

Training

# Basic training (unique handwritten lines only)
python Scripts/train.py \
    --data_dir ./data/DASTNUS \
    --vocab_path Kurdish-HLR-Model/vocab.json

# Full training with synthetic lines + writer mixing (best configuration)
python Scripts/train.py \
    --data_dir ./data/DASTNUS \
    --vocab_path Kurdish-HLR-Model/vocab.json \
    --use_synthetic \
    --synthetic_dir ./data/Synthetic-Lines \
    --use_writer_mixing \
    --fixed_lines_dir ./data/Fixed-Lines \
    --num_writers 50

Synthetic Line Generation

python Scripts/synthetic_line_generator.py \
    --unique_words_dir ./data/Unique-Words \
    --person_names_dir ./data/Person-Names \
    --output_dir ./data/Synthetic-Lines \
    --training_writers ./writers/Training.txt \
    --validation_writers ./writers/Validation.txt \
    --testing_writers ./writers/Testing.txt

Models

Model Language Vocabulary Format
Kurdish-HLR-Model Kurdish (Sorani) 114 tokens safetensors
Arabic-HLR-Model Arabic 192 tokens (unified) safetensors
Urdu-HLR-Model Urdu 192 tokens (unified) safetensors

The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.

Dataset

The models were trained using the following subsets of the DASTNUS Kurdish handwritten dataset:

Data Source Training Validation Testing
Unique handwritten lines 3,575 655 649
Synthetic handwritten lines 3,762 - -
Fixed-content lines (50 writers) 512 - -
Total 7,849 655 649

The data used in this research is available upon request for non-commercial scientific research purposes only.

Citation

[]

License

This repository is released for non-commercial scientific research purposes only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support