KaLM-Embedding
KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data.
KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data.
📑 Open-source Plan
- Model Checkpoint
- KaLM-embedding-multilingual-mini-v1
- KaLM-embedding-multilingual-mini-instruct-v1
- KaLM-embedding-multilingual-mini-instruct-v1.5
- KaLM-embedding-multilingual-max-v1
- Training and Evaluation Code: HITsz-TMG/KaLM-Embedding
- Technical Report: KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
- Training Data
Evaluation
Model Name | Model Size | C-MTEB(35) | MTEB(56) | avg |
---|---|---|---|---|
multilingual-e5-large | 560M | 58.81 | 61.5 | 60.16 |
bge-m3 (dense) | 560M | 60.80 | 59.84 | 60.32 |
gte-multilingual-base (dense) | 305M | 62.72 | 61.40 | 62.06 |
KaLM-embedding-multilingual-mini-v1 | 494M | 62.31 | 61.87 | 62.09 |
KaLM-embedding-multilingual-mini-instruct-v1 | 494M | 63.57 | 64.74 | 64.16 |
KaLM-embedding-multilingual-mini-instruct-v1.5 | 494M | 64.13 | 64.94 | 64.53 |
Requirements
Since we have used the Qwen2 model, we advise you to install transformers>=4.37.0
, or you might encounter the following error:
KeyError: 'qwen2'
Usage
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
embeddings = model.encode(
sentences,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
We add instruction for asymmetric tasks: retrieval, reranking, classification and clustering.
If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
sentences,
prompt=prompt,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
Contact
If you encounter any issue, feel free to contact us via the email: yanshek.woo@gmail.com
- Downloads last month
- 43,051
Space using HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5 1
Collection including HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported94.685
- ap on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported61.631
- ap_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported61.631
- f1 on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported87.071
- f1_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported94.925
- main_score on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported94.685
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported91.731
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported68.350
- ap_weighted on MTEB AmazonCounterfactualClassification (en)test set self-reported68.350
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported87.905