Some questions about reproduceing this MTEB-result
HI, Cannot reproduce this MTEB-result, could you please re-check the model weight?
Besides, what is the pooling of this model, in your transformer code example, the pooling is last_token, however in your sentence-transformer code, the pooling is mean( loss 1_Pooling? The default pooling in sentence-transformers is mean. )
Hi infgrad,
Thank you for bringing this to our attention. We have carefully reviewed your concerns and retested the model weights. We can confirm that these weights do indeed reproduce the MTEB results as expected.
Regarding the pooling method, we originally used last-token pooling. However, thanks to Tom's assistance, we have corrected the discrepancy in the sentence transformer implementation. The issue has been resolved, and the correct pooling method is now consistently applied.
Could you please retry using the last-token pooling and let us know the results you achieve? We are eager to further discuss and ensure the reproducibility of our model.
Best,
Ye
@yliu279 I cannot reproduce the results with sentence-transformers code (using latest one). When runing the eval of BEIR-FiQA on the curated, sentence-transformers-based pipeline, I get recall@10=17.111, where it should be 69.440. Sounds like pooling might be somehow wrong, could you confirm that padding&accessing last token works the same way across transformers and sentence-transformers pipelines?
@yliu279
@mbien
Hi, I carefully reviewed my test codes and still get the same results. Here is minimal reproduction code:
import functools
import os
from mteb import MTEB
from sentence_transformers import SentenceTransformer
if __name__ == "__main__":
# load model
model = SentenceTransformer("/mnt/hwdata/ip/nlp/public_models/SFR-Embedding-2_R", device="cuda")
model.encode = functools.partial(
model.encode,
batch_size=8,
show_progress_bar=True,
prompt="Instruct: Retrieve semantically similar text.\nQuery: " # only test STS
)
evaluation = MTEB(tasks=["STSBenchmark"], task_langs=["en"])
evaluation.run(
model,
output_folder=f"sts_results",
eval_splits=["test"],
verbosity=2,
overwrite_results=True,
)
The result is:
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"evaluation_time": 84.45004653930664,
"kg_co2_emissions": null,
"mteb_version": "1.12.48",
"scores": {
"test": [
{
"cosine_pearson": 0.701287466112608,
"cosine_spearman": 0.7236247747370012,
"euclidean_pearson": 0.7204492422443474,
"euclidean_spearman": 0.7233661781589509,
"hf_subset": "default",
"languages": [
"eng-Latn"
],
"main_score": 0.7236247747370012,
"manhattan_pearson": 0.7397991758538812,
"manhattan_spearman": 0.7408110026601888,
"pearson": [
0.7012874418016998,
1.2264194202428853e-204
],
"spearman": [
0.7236247747370012,
5.439385473263533e-224
]
}
]
},
"task_name": "STSBenchmark"
}
However, in your README.md, the result is:
- task:
type: STS
dataset:
type: mteb/stsbenchmark-sts
name: MTEB STSBenchmark
config: default
split: test
revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
metrics:
- type: cos_sim_pearson
value: 83.55433725920493
- type: cos_sim_spearman
value: 83.60373857254014
- type: euclidean_pearson
value: 83.08086082334839
- type: euclidean_spearman
value: 83.6036864776559
- type: manhattan_pearson
value: 83.2232267589246
- type: manhattan_spearman
value: 83.78923946962664
~<
We noticed a discrepancy in the Sentence Transformer Evaluation. We are currently working on resolving this issue and will share the solution shortly. In the meantime, here is the process we use to produce the results. Please feel free to try it if you are interested:
Use E5 evaluation pipeline: https://github.com/microsoft/unilm/blob/master/e5/mteb_except_retrieval_eval.py
First Two editions in utils.py:
- Add 'SFR-Embedding-2_R': 'instruction', to MODEL_NAME_TO_PREFIX_TYPE dict and 'SFR-Embedding-2_R': 'last' to MODEL_NAME_TO_POOL_TYPE in utils.py
- revise create_batch_dict() function in utils.py as:
batch_dict = tokenizer(
input_texts,
max_length=max_length - 1,
return_token_type_ids=False,
return_attention_mask=False,
padding=False,
truncation=True
)
# append eos_token_id to every input_ids
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
return tokenizer.pad(
batch_dict,
padding=True,
pad_to_multiple_of=8,
return_attention_mask=True,
return_tensors="pt",
)
Second:
In e5_mteb_except_retrieval_eval.py main() function:
model = DenseEncoder()
evaluation = MTEB(tasks=["STSBenchmark"], task_langs=["en"])
evaluation.run(
model,
output_folder=f"sts_results",
eval_splits=["test"],
verbosity=2,
overwrite_results=True,
)
You will get results as below:
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"evaluation_time": 14.305434942245483,
"kg_co2_emissions": null,
"mteb_version": "1.12.48",
"scores": {
"test": [
{
"cosine_pearson": 0.8355240450842275,
"cosine_spearman": 0.8360701599480195,
"euclidean_pearson": 0.8307927408782112,
"euclidean_spearman": 0.8360703731734451,
"hf_subset": "default",
"languages": [
"eng-Latn"
],
"main_score": 0.8360701599480195,
"manhattan_pearson": 0.832215434631109,
"manhattan_spearman": 0.8378697003913586,
"pearson": 0.8355240450842275,
"spearman": 0.8360701599480195
}
]
},
"task_name": "STSBenchmark"
}
The Sentence Transformer evaluation is now functioning correctly. We have added "add_eos_token": true in the tokenizer_config.json. You can now obtain accurate results using the ST evaluation.
"scores": {
"test": [
{
"cosine_pearson": 0.8355526890934296,
"cosine_spearman": 0.8360173852997346,
"euclidean_pearson": 0.830706240702224,
"euclidean_spearman": 0.8365412824235895,
"hf_subset": "default",
"languages": [
"eng-Latn"
],
"main_score": 0.8360173852997346,
"manhattan_pearson": 0.8318737804127988,
"manhattan_spearman": 0.8380955443197002,
"pearson": [
0.8355526691849025,
0.0
],
"spearman": [
0.8360186564578723,
0.0
]
}
]
},
"task_name": "STSBenchmark"
Great!
I can now obtain accurate results in all MTEB tasks.