Embedding-Amharic-Base

This is a sentence-transformers model finetuned from rasyosef/roberta-base-amharic. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

It was introduced in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: rasyosef/roberta-base-amharic
  • Maximum Sequence Length: 510 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Language: am
  • License: mit

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rasyosef/embedding-amharic-base")

# What is the capital of Ethiopia? / France
queries = ['የኢትዮጵያ ዋና ከተማ ማናት?', 'የፈረንሳይ ዋና ከተማ ማናት?'] 

# Addis Ababa, Gondar, Paris, London, Washington D.C.
documents = ['አዲስ አበባ', 'ጎንደር', 'ፓሪስ', 'ለንደን', 'ዋሽንግተን ዲሲ'] 

# Compute embeddings
query_embeddings = model.encode_query(queries) # [2, 768]
document_embeddings = model.encode_document(documents) # [5, 768]

# Calculate semantic similarity
similarities = model.similarity(
    query_embeddings, 
    document_embeddings
)

print(similarities)
# tensor([[0.5075, 0.3114, 0.0798, 0.1967, 0.1340],
#         [0.1777, 0.0770, 0.5714, 0.2596, 0.1076]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_recall@5 0.8698
cosine_recall@10 0.9051
cosine_ndcg@10 0.8037
cosine_mrr@10 0.7708

Information Retrieval

Metric Value
cosine_recall@5 0.8647
cosine_recall@10 0.902
cosine_ndcg@10 0.7978
cosine_mrr@10 0.764

Training Details

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • gradient_accumulation_steps: 2
  • learning_rate: 6e-05
  • num_train_epochs: 6
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.025
  • fp16: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_256_cosine_ndcg@10
-1 -1 - 0.0735 0.0582
1.0 1921 0.6769 0.7826 0.7751
2.0 3842 0.07 0.7894 0.7829
3.0 5763 0.0254 0.8030 0.7953
4.0 7684 0.0139 0.8037 0.7978

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.1+cu126
  • Accelerate: 1.7.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}
Downloads last month
394
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rasyosef/embedding-amharic-base

Finetuned
(12)
this model

Dataset used to train rasyosef/embedding-amharic-base

Collection including rasyosef/embedding-amharic-base

Paper for rasyosef/embedding-amharic-base

Evaluation results