ModernUzBERT: A State-of-the-Art Semantic Representation Model for Uzbek

ModernUzBERT is a high-performance embedding model specifically architected and optimized for the Uzbek language. Leveraging the ModernBERT framework, it has been trained on an extensive Uzbek corpus and fine-tuned on 29,000+ domain-specific pairs to achieve superior accuracy in semantic retrieval and document representation.

📊 Dataset and Corpus Details

The model's linguistic foundation is built upon a massive and diverse Uzbek dataset, ensuring deep semantic understanding.

Metric	Value / Detail
Training Samples	30,000 (CQA Pairs)
Total Word Count	125,261,608
Vocabulary Size	1,348,641 unique words
Training Loss	MultipleNegativesRankingLoss

📈 Evaluation Benchmarks (Scientific Results)

ModernUzBERT establishes a new State-of-the-Art (SOTA) for Uzbek NLP, outperforming both local baselines and global multilingual models.

1. Retrieval Accuracy (Recall & MRR)

Metric	ModernUzBERT	UzRoBERTa	BGE-M3
Recall@1	0.62	0.56	0.66
Recall@3	0.84	0.73	0.80
Recall@5	0.88	0.79	0.83
Recall@10	0.94	0.86	0.87
MRR@10	0.74	0.66	0.74

2. Efficiency Analysis (Inference Speed)

Model	Avg. Latency (s)	Efficiency Rank
ModernUzBERT	0.0051	1st (Fastest)
UzRoBERTa	0.0059	2nd
BGE-M3	0.0135	3rd

🛠 Usage

Installation

#pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Orzumurod/ModernUzBERT")

# Example Uzbek sentences
sentences = [
    "Yuridik shaxsning taʼsis hujjatlarida qanday maʼlumotlar aks ettirilishi kerak?",
    "Taʼsischisi, yuridik shaxsning pochta manzili va ustav fondi miqdori hujjatlarda aks etishi lozim."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Compute similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(f"Similarity Score: {similarity.item():.2f}")

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)

@misc{modernuzbert2026,
  author = {Orzumurod},
  title = {ModernUzBERT: Advanced Semantic Embeddings for the Uzbek Language},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/Orzumurod/ModernUzBERT](https://huggingface.co/Orzumurod/ModernUzBERT)}}
}

Downloads last month: 76

Safetensors

Model size

0.2B params

Tensor type

F32