Sentence Similarity
sentence-transformers
Safetensors
Uzbek
modernbert
feature-extraction
dense
Generated from Trainer
dataset_size:30000
loss:MultipleNegativesRankingLoss
uzbek
text-embeddings-inference
Instructions to use Orzumurod/ModernUzBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Orzumurod/ModernUzBERT with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Orzumurod/ModernUzBERT") sentences = [ "Yuridik shaxsning taʼsis hujjatlarida qanday maʼlumotlar aks ettirilishi zarur bo'yicha qanday yechim mavjud?", "Mazkur masala bo'yicha asosiy javob: Taʼsischisi, yuridik shaxsning pochta manzili, ustav fondi miqdori va uning shakli, manbalari kiradi. So‘rovnoma beruvchi shu tartibga rioya qilishi lozim.", "UzAuto Motors kompaniyasi 22 dekabr kuni Cobalt, Damas va Labo avtomobillari uchun onlayn kontraksiyalar ochilishini e’lon qilgandi.", "A Seriyadagi ketma-ket mag‘lubiyatsiz o‘yinlari soni 7taga yetgan bo‘lsa, Eldorning mavsumdagi gollari soni 3taga yetdi." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
ModernUzBERT: A State-of-the-Art Semantic Representation Model for Uzbek
ModernUzBERT is a high-performance embedding model specifically architected and optimized for the Uzbek language. Leveraging the ModernBERT framework, it has been trained on an extensive Uzbek corpus and fine-tuned on 29,000+ domain-specific pairs to achieve superior accuracy in semantic retrieval and document representation.
📊 Dataset and Corpus Details
The model's linguistic foundation is built upon a massive and diverse Uzbek dataset, ensuring deep semantic understanding.
| Metric | Value / Detail |
|---|---|
| Training Samples | 30,000 (CQA Pairs) |
| Total Word Count | 125,261,608 |
| Vocabulary Size | 1,348,641 unique words |
| Training Loss | MultipleNegativesRankingLoss |
📈 Evaluation Benchmarks (Scientific Results)
ModernUzBERT establishes a new State-of-the-Art (SOTA) for Uzbek NLP, outperforming both local baselines and global multilingual models.
1. Retrieval Accuracy (Recall & MRR)
| Metric | ModernUzBERT | UzRoBERTa | BGE-M3 |
|---|---|---|---|
| Recall@1 | 0.62 | 0.56 | 0.66 |
| Recall@3 | 0.84 | 0.73 | 0.80 |
| Recall@5 | 0.88 | 0.79 | 0.83 |
| Recall@10 | 0.94 | 0.86 | 0.87 |
| MRR@10 | 0.74 | 0.66 | 0.74 |
2. Efficiency Analysis (Inference Speed)
| Model | Avg. Latency (s) | Efficiency Rank |
|---|---|---|
| ModernUzBERT | 0.0051 | 1st (Fastest) |
| UzRoBERTa | 0.0059 | 2nd |
| BGE-M3 | 0.0135 | 3rd |
🛠 Usage
Installation
#pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("Orzumurod/ModernUzBERT")
# Example Uzbek sentences
sentences = [
"Yuridik shaxsning taʼsis hujjatlarida qanday maʼlumotlar aks ettirilishi kerak?",
"Taʼsischisi, yuridik shaxsning pochta manzili va ustav fondi miqdori hujjatlarda aks etishi lozim."
]
# Generate embeddings
embeddings = model.encode(sentences)
# Compute similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(f"Similarity Score: {similarity.item():.2f}")
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)
@misc{modernuzbert2026,
author = {Orzumurod},
title = {ModernUzBERT: Advanced Semantic Embeddings for the Uzbek Language},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/Orzumurod/ModernUzBERT](https://huggingface.co/Orzumurod/ModernUzBERT)}}
}
- Downloads last month
- 76