SkillRet-Embedding-0.6B

arXiv

This is a sentence-transformers model fine-tuned for AI agent skill retrieval. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library.

The model is fine-tuned from Qwen/Qwen3-Embedding-0.6B on the SkillRet benchmark training split using contrastive learning (MultipleNegativesRankingLoss).

📄 Technical report: SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents (arXiv:2605.05726)

Usage

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ThakiCloud/SkillRet-Embedding-0.6B", trust_remote_code=True)

query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: "

queries = [
    query_prompt + "Help me set up a CI/CD pipeline for my Python project"
]
skills = [
    "ci-cd-setup | Configure continuous integration and deployment pipelines ...",
    "python-debugging | Debug Python applications using pdb and logging ...",
]

q_emb = model.encode(queries, normalize_embeddings=True)
s_emb = model.encode(skills, normalize_embeddings=True)

similarities = q_emb @ s_emb.T
print(similarities)

Training Details

  • Base model: Qwen3-Embedding-0.6B (0.6B parameters)
  • Training data: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills)
  • Loss: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing
  • Hardware: 4× NVIDIA B200 GPUs (DDP)
  • Effective batch size: 384 (96 per device × 4 GPUs)
  • Max sequence length: 8,192 tokens
  • Learning rate: 2e-5
  • Epochs: 1
  • Training time: ~6 hours
  • Precision: BF16

Training Logs

Epoch Step Training Loss NDCG@15
0.15 50 2.4288 0.7802
0.30 100 1.9920 0.7842
0.45 150 1.9758 0.7887
0.60 200 1.9011 0.7865
0.76 250 1.9100 0.7874
0.91 300 1.9412 0.7859
1.0 331 0.7862

Best checkpoint at step 150 (bold row).

Evaluation Results

Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills).

Metric @5 @10 @15
NDCG 0.7557 0.7803 0.7887
Recall 0.7915 0.8542 0.8809
Completeness 0.6596 0.7509 0.7903

Intended Use

This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents.

Limitations

  • Optimized for English-language queries and agent skills.
  • Performance may vary on domains outside the SkillRet benchmark distribution.
  • The model retrieves skills but does not execute them.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.4.1
  • Transformers: 5.5.4
  • PyTorch: 2.7.1+cu128

Citation

If you use this model or the SkillRet benchmark, please cite:

@article{cho2026skillret,
  title   = {SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents},
  author  = {Cho, Hongcheol and Kang, Ryangkyung and Kim, Youngeun},
  journal = {arXiv preprint arXiv:2605.05726},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.05726}
}

Paper: https://arxiv.org/abs/2605.05726

Downloads last month
5,335
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW

Model tree for ThakiCloud/SKILLRET-Embedding-0.6B

Finetuned
(177)
this model

Paper for ThakiCloud/SKILLRET-Embedding-0.6B