RoBERTa Amharic Text Embedding Base

This is a sentence-transformers model finetuned from yosefw/roberta-base-am-embed on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: yosefw/roberta-base-am-embed
  • Maximum Sequence Length: 510 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("rasyosef/roberta-amharic-text-embedding-base")
# Run inference
sentences = [
  "የተደጋገመው የመሬት መንቀጥቀጥና የእሳተ ገሞራ ምልክት በአፋር ክልል",
  "ከተደጋጋሚ መሬት መንቀጥቀጥ በኋላ አፋር ክልል እሳት ከመሬት ውስጥ ሲፈላ ታይቷል፡፡ ከመሬት ውስጥ እሳትና ጭስ የሚተፋው እንፋሎቱ ዛሬ ማለዳውን 11 ሰዓት ግድም ከከባድ ፍንዳታ በኋላየተስተዋለ መሆኑን የአከባቢው ነዋሪዎች እና ባለስልጣናት ለዶቼ ቬለ ተናግረዋል፡፡ አለት የሚያፈናጥር እሳት ነው የተባለው እንፋሎቱ በክልሉ ጋቢረሱ (ዞን 03) ዱለቻ ወረዳ ሰጋንቶ ቀበሌ መከሰቱን የገለጹት የአከባቢው የአይን እማኞች ከዋናው ፍንዳታ በተጨማሪ በዙሪያው ተጨማሪ ፍንዳታዎች መታየት ቀጥሏል ባይ ናቸው፡፡",
  "ለኢትዮጵያ ብሔራዊ ባንክ ዋጋን የማረጋጋት ቀዳሚ ዓላማ ጋር የተጣጣሙ የገንዘብ ፖሊሲ ምክረ ሀሳቦችን እንዲሰጥ የተቋቋመው የኢትዮጵያ ብሔራዊ ባንክ የገንዘብ ፖሊሲ ኮሚቴ እስካለፈው ህዳር ወር የነበረው እአአ የ2024 የዋጋ ግሽበት በተለይምምግብ ነክ ምርቶች ላይ ከአንድ ዓመት በፊት ከነበው ጋር ሲነጻጸር መረጋጋት ማሳየቱን ጠቁሟል፡፡ ዶይቼ ቬለ ያነጋገራቸው የአዲስ አበባ ነዋሪዎች ግን በዚህ የሚስማሙ አይመስልም፡፡ ከአምና አንጻር ያልጨመረ ነገር የለም ባይ ናቸው፡፡ የኢኮኖሚ  ባለሙያም በሰጡን አስተያየት ጭማሪው በሁሉም ረገድ የተስተዋለ በመሆኑ የመንግስት ወጪን በመቀነስ ግብርናው ላይ አተኩሮ መስራት ምናልባትም የዋጋ መረጋጋቱን ሊያመጣ ይችላል ይላሉ፡፡"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric dim_768 dim_512 dim_384 dim_256 dim_128 dim_64
cosine_accuracy@1 0.6833 0.6792 0.6744 0.675 0.6586 0.6336
cosine_accuracy@3 0.8075 0.8104 0.8078 0.804 0.7937 0.7777
cosine_accuracy@5 0.8511 0.8521 0.8524 0.8479 0.8383 0.8232
cosine_accuracy@10 0.8967 0.8938 0.8916 0.8896 0.8774 0.8669
cosine_precision@1 0.6833 0.6792 0.6744 0.675 0.6586 0.6336
cosine_precision@3 0.2692 0.2701 0.2693 0.268 0.2646 0.2592
cosine_precision@5 0.1702 0.1704 0.1705 0.1696 0.1677 0.1646
cosine_precision@10 0.0897 0.0894 0.0892 0.089 0.0877 0.0867
cosine_recall@1 0.6833 0.6792 0.6744 0.675 0.6586 0.6336
cosine_recall@3 0.8075 0.8104 0.8078 0.804 0.7937 0.7777
cosine_recall@5 0.8511 0.8521 0.8524 0.8479 0.8383 0.8232
cosine_recall@10 0.8967 0.8938 0.8916 0.8896 0.8774 0.8669
cosine_ndcg@10 0.7899 0.7881 0.7846 0.7833 0.7695 0.7515
cosine_mrr@10 0.7558 0.7541 0.7501 0.7491 0.7347 0.7144
cosine_map@100 0.7589 0.7574 0.7536 0.7526 0.7387 0.7186

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 28,046 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 4 tokens
    • mean: 14.56 tokens
    • max: 47 tokens
    • min: 42 tokens
    • mean: 204.24 tokens
    • max: 510 tokens
  • Samples:
    anchor positive
    የዱር እንስሳት ከሰዎች ጋር በሚኖራቸው ቁርኝት ለኮሮናቫይረስ ተጋላጭ እንዳይሆኑ የመከላከል ተግባራትን እያከናወኑ መሆኑን ባለስልጣኑ አስታወቀ፡፡ ባሕርዳር፡ ግንቦት 18/2012 ዓ.ም (አብመድ) የአማራ ክልል የአካባቢ፣ የደንና የዱር እንስሳት ጥበቃና ልማት ባለስልጣን በሚያስተዳድራቸው ብሔራዊ ፓርኮች እና የማኅበረሰብ ጥብቅ ሥፍራዎች ከኮሮናቫይረስ ተጋላጭነት ለመከላከል እየሠራ መሆኑን አስታውቋል፡፡የባለስልጣኑ የኮሙዩኒኬሽን ዳይሬክተር ጋሻው እሸቱ 10 በሚሆኑ ብሔራዊ ፓርኮችና የማኅበረሰብ ጥብቅ ሥፍራዎች የኮሮና ቫይረስን መከላከል በሚቻልባቸው ቅድመ ተግባራት እና ርምጃዎች ላይ መምከራቸውን ተናግረዋል፡፡ የዱር እንስሳት በመንጋ የሚኖሩ፣ እርስ በርሳቸው ተመጋጋቢ፣ ከሰዎች እና ከቤት እንስሳቶች ጋር ሊቀላቀሉ የሚችሉ በመሆናቸው በኮሮናቫይረስ ከተጋለጡ ‘‘የኮሮናቫይረስ ተጋላጭነት በብርቅየ የዱር እንስሳት ብዝኃ ሕይወት ላይ ስጋት መሆን የለበትም’’ ያሉት አቶ ጋሻው በፓርኮቹ ውስጥ ለሚሠሩ የጥበቃ፣ ስካውት እና ለጽሕፈት ቤት ሠራተኞች በዘርፉ ላይ ያተኮረ የኮሮናቫይረስ መከላከያ ትምህርቶችን እና የቁሳቁስ ድጋፎችን ማድረጋቸውን አስታውቀዋል፡፡
    የትግራይ ክልል የአየር መሥመር ለአገልግሎት ክፍት ሆነ፡፡
    የትግራይ ክልል የአየር መሥመር ለአገልግሎት ክፍት ሆነ፡፡
    ባሕር ዳር፡ ታኅሣሥ 05/2013 ዓ.ም (አብመድ) በሰሜን ኢትዮጵያ ትግራይ ክልል የህግ ማስከበር ሂደትን ተከትሎ ተዘግቶ የነበረው የአየር ክልል ከዛሬ ታህሣሥ 5/2013 ዓ.ም ከቀኑ 8 ሰዓት ጀምሮ በሰሜን የኢትዮጵያ የአየር ክልል ውስጥ የሚያቋርጡ የአለም አቀፍ እና የሃገር ውስጥ የበረራ መስመሮች ለአገልግሎት ክፍት ሆነዋል፡፡ አገልግሎት መሥጠት የሚችሉ ኤርፖርቶች በረራ ማስተናገድ የሚችሉ መሆኑንም የኢትዮጵያ ሲቪል አቪዬሽን ባለስልጣን ገልጿል::
    የአውሮፓ ኢንቨስትመንት ባንክ ለመንግሥት 76 ሚሊዮን ዶላር ሊያበድር ነው በዳዊት እንደሻውየአውሮፓ ኢንቨስትመንት ባንክ ጽሕፈት ቤቱን በአዲስ አበባ ከከፈተ ከሁለት ዓመት በኋላ ትልቅ ነው የተባለለትን የ76 ሚሊዮን ዶላር ብድር ስምምነት ለመፈራረም፣ ኃላፊዎቹን ወደ ኢትዮጵያ ይልካል፡፡ከወር በፊት በኢትዮጵያ መንግሥትና በባንኩ መካከል የተደረገው ይኼ የብድር ስምምነት፣ የኢትዮጵያ ልማት ባንክ በሊዝ ፋይናንሲንግ ለአነስተኛና ለመካከለኛ ኢንተርፕራይዞች ለሚያደርገው እገዛ ይውላል፡፡የአውሮፓ ኢንቨስትመንት ባንክ ምክትል ፕሬዚዳንት ፒም ቫን በሌኮም፣ እንዲሁም ሌሎች ኃላፊዎች ይመጣሉ ተብሎ ይጠበቃል፡፡በዚህም መሠረት የባንኩ ኃላፊዎች ከገንዘብና ኢኮኖሚ ትብብር ሚኒስቴር ጋር አድርገውት ከነበረው ስምምነት የሚቀጥልና ተመሳሳይ የሆነ ስምምነት፣ ከኢትዮጵያ ልማት ባንክ ጋር እንደሚያደርጉ ይጠበቃል፡፡እ.ኤ.አ. እስከ 2022 ድረስ የሚቀጥለው አነስተኛና መካከለኛ ኢንተርፕራይዞችን የማገዝ ፕሮጀክት 276 ሚሊዮን ዶላር ወጪ የሚያስወጣ ሲሆን፣ ባለፈው ዓመት የዓለም ባንክ ወደ 200 ሚሊዮን ዶላር ብድር ሰጥቷል፡፡በአውሮፓ ኢንቨስትመንት ባንክ የሚሰጠው ብድር፣ የኢትዮጵያ ልማት ባንክን የሊዝ ፋይናንሲንግ ሥራ እንደሚያግዝ ጉዳዩ የሚመለከታቸው የልማት ባንክ ኃላፊዎች ለሪፖርተር ተናግረዋል፡፡ ‹‹በተጨማሪም የውጭ ምንዛሪ እጥረቱን ለማቃለል ያግዛል፤›› ሲሉ ኃላፊው ገልጸዋል፡፡በልማት ባንክ በኩል የሚደረገው እገዛ በሁለት መስኮቶች የሚወጣ ሲሆን፣ አንደኛው በቀጥታ በባንክ እንደ ሊዝ ፋይናንሲንግ ሲሰጥ ሌላው ደግሞ እንደ መሥሪያ ካፒታል ልማት ባንክ ለመረጣቸው 12 ባንኮችና ዘጠኝ ማይክሮ ፋይናንሶች ይሰጣል፡፡የአውሮፓ ኢንቨስትመንት ባንክ በኢትዮጵያ መንቀሳቀስ ከጀመረ ከ1980ዎቹ ጀምሮ ወደ ግማሽ ቢሊዮን ዶላር የሚጠጋ ለኃይል፣ ለኮሙዩኒኬሽንና ለግሉ ዘርፍ ኢ...
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            384,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_384_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.0455 10 19.6872 - - - - - -
0.0909 20 11.0221 - - - - - -
0.1364 30 4.1418 - - - - - -
0.1818 40 2.6854 - - - - - -
0.2273 50 2.1661 - - - - - -
0.2727 60 1.7602 - - - - - -
0.3182 70 1.6862 - - - - - -
0.3636 80 1.484 - - - - - -
0.4091 90 1.2841 - - - - - -
0.4545 100 1.3569 - - - - - -
0.5 110 1.3734 - - - - - -
0.5455 120 1.3205 - - - - - -
0.5909 130 1.1156 - - - - - -
0.6364 140 1.0249 - - - - - -
0.6818 150 1.0461 - - - - - -
0.7273 160 1.0729 - - - - - -
0.7727 170 0.9913 - - - - - -
0.8182 180 1.027 - - - - - -
0.8636 190 1.0165 - - - - - -
0.9091 200 0.9928 - - - - - -
0.9545 210 0.971 - - - - - -
1.0 220 0.9636 0.7656 0.7614 0.7588 0.7534 0.7385 0.7081
1.0455 230 1.0605 - - - - - -
1.0909 240 0.9032 - - - - - -
1.1364 250 0.7504 - - - - - -
1.1818 260 0.7361 - - - - - -
1.2273 270 0.4918 - - - - - -
1.2727 280 0.3651 - - - - - -
1.3182 290 0.3963 - - - - - -
1.3636 300 0.4032 - - - - - -
1.4091 310 0.2712 - - - - - -
1.4545 320 0.26 - - - - - -
1.5 330 0.3159 - - - - - -
1.5455 340 0.2913 - - - - - -
1.5909 350 0.2569 - - - - - -
1.6364 360 0.1793 - - - - - -
1.6818 370 0.2063 - - - - - -
1.7273 380 0.2065 - - - - - -
1.7727 390 0.1945 - - - - - -
1.8182 400 0.2352 - - - - - -
1.8636 410 0.2077 - - - - - -
1.9091 420 0.2017 - - - - - -
1.9545 430 0.1806 - - - - - -
2.0 440 0.2214 0.7773 0.7754 0.7738 0.7670 0.7552 0.7332
2.0455 450 0.2133 - - - - - -
2.0909 460 0.2202 - - - - - -
2.1364 470 0.1333 - - - - - -
2.1818 480 0.1789 - - - - - -
2.2273 490 0.1025 - - - - - -
2.2727 500 0.0897 - - - - - -
2.3182 510 0.1128 - - - - - -
2.3636 520 0.1218 - - - - - -
2.4091 530 0.0747 - - - - - -
2.4545 540 0.0596 - - - - - -
2.5 550 0.0942 - - - - - -
2.5455 560 0.1011 - - - - - -
2.5909 570 0.0606 - - - - - -
2.6364 580 0.0483 - - - - - -
2.6818 590 0.057 - - - - - -
2.7273 600 0.0504 - - - - - -
2.7727 610 0.0497 - - - - - -
2.8182 620 0.0585 - - - - - -
2.8636 630 0.0791 - - - - - -
2.9091 640 0.0556 - - - - - -
2.9545 650 0.0555 - - - - - -
3.0 660 0.0598 0.7857 0.7821 0.7811 0.7766 0.7650 0.7443
3.0455 670 0.081 - - - - - -
3.0909 680 0.065 - - - - - -
3.1364 690 0.0566 - - - - - -
3.1818 700 0.0758 - - - - - -
3.2273 710 0.0378 - - - - - -
3.2727 720 0.04 - - - - - -
3.3182 730 0.0465 - - - - - -
3.3636 740 0.0426 - - - - - -
3.4091 750 0.0348 - - - - - -
3.4545 760 0.0254 - - - - - -
3.5 770 0.044 - - - - - -
3.5455 780 0.0455 - - - - - -
3.5909 790 0.0274 - - - - - -
3.6364 800 0.0212 - - - - - -
3.6818 810 0.0279 - - - - - -
3.7273 820 0.0269 - - - - - -
3.7727 830 0.0243 - - - - - -
3.8182 840 0.03 - - - - - -
3.8636 850 0.0359 - - - - - -
3.9091 860 0.0308 - - - - - -
3.9545 870 0.0253 - - - - - -
4.0 880 0.0417 0.7899 0.7881 0.7846 0.7833 0.7695 0.7515
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
15
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for rasyosef/roberta-amharic-text-embedding-base

Finetuned
(1)
this model

Evaluation results