E5 Large V2 with 1024 tokens context window

This model uses positional embeddings tweak, which allows it to has a context window twice as it's base model (intfloat/e5-large-v2). This is a training-less context improvement, meaning that no fine-tuning was applied to any of the base model parameters. The two models differ only in their positional embeddings. Paper describing applied tweak is coming soon.

Evaluation Results

The model was evaluated on four retrieval tasks from LongEmbed benchmark and achieved following NDCG@10 metrics (arb. units):

LEMBSummScreenFDRetrieval: 0.8617 (8.80 pp gain over intfloat/e5-large-v2),
LEMBQMSumRetrieval: 0.3112 (6.04 pp gain over intfloat/e5-large-v2),
LEMBWikimQARetrieval: 0.6570 (7.27 pp gain over intfloat/e5-large-v2),
LEMBNarrativeQARetrieval: 0.2792 (1.55 pp gain over intfloat/e5-large-v2)

Steps to reproduce:

from sentence_transformers import SentenceTransformer  # 4.0.2
import mteb  # 1.38.2


# load model
model = SentenceTransformer('idanylenko/e5-large-v2-ctx1024')

# define tasks
retrieval_task_list = [
    "LEMBSummScreenFDRetrieval",
    "LEMBQMSumRetrieval",
    "LEMBWikimQARetrieval",
    "LEMBNarrativeQARetrieval"
]
tasks = mteb.get_tasks(tasks=retrieval_task_list)

# run the evaluation
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model)

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for idanylenko/e5-large-v2-ctx1024

Base model

intfloat/e5-large-v2

Finetuned

(14)

this model

Collection including idanylenko/e5-large-v2-ctx1024

Encoders With Extended Context

Collection

A collection of common pretrained sentence transformers with enhanced context window via zero-training position embeddings approximation • 1 item • Updated Apr 30