jinaai/jina-embeddings-v3 · Finetuning The Model with Custom Dataset

Nov 28, 2024

•

edited Nov 28, 2024

I am trying to finetune this model with SentenceTransformerTrainer for updating the retrieval.passage adapter's weights. Do you have any tutorial notebook or something for this? I am getting different errors from different parts like

RuntimeError: FlashAttention is not installed. To proceed with training, please install FlashAttention. For inference, you have two options: either install FlashAttention or disable it by setting use_flash_attn=False when loading the model.

(Even I disabled use flash attention option (idk if I am doing it right))

RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.

NameError: name 'IterableDataset' is not defined

Here is my implementation:
I am running it on Colab session with A100 GPU

!pip install --upgrade torch transformers sentence-transformers
!pip install flash-attn --no-build-isolation
!pip install datasets
!pip install einops
!pip install 'numpy<2'

from sentence_transformers import SentenceTransformer
import torch

# I am not sure about the kwargs so I put them in both model and config kwargs
model = SentenceTransformer("jinaai/jina-embeddings-v3", 
                            trust_remote_code=True,
                            model_kwargs={'use_flash_attn': False,
                                          "use_cache": False,
                                           'lora_main_params_trainable': False, 
                                           "default_task": "retrieval.passage", 
                                           "torch_dtype": torch.float16},
                            config_kwargs={'use_flash_attn': False,
                                           "use_cache": False,
                                           'lora_main_params_trainable': False, 
                                           "default_task": "retrieval.passage", 
                                           "torch_dtype": torch.float16}) 

dataset = torch.load('/content/drive/MyDrive/matrag/finetuning_dataset/msmarco_tr_sample.pt')
train_dataset = dataset['train']
eval_dataset = dataset['eval']

from sentence_transformers.losses import CoSENTLoss
from sentence_transformers.training_args import SentenceTransformerTrainingArguments, BatchSamplers
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
from sentence_transformers import SentenceTransformerTrainer

loss = CoSENTLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir='jina-embeddings-v3',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    lr_scheduler_type='cosine',
    warmup_ratio=0.1,
    bf16=False,
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    eval_strategy='steps',
    eval_steps=1000,
    save_strategy='steps',
    save_steps=1000,
    save_total_limit=2,
    logging_steps=1000,
    load_best_model_at_end=True,
    metric_for_best_model='cosine_accuracy',
)

evaluator = EmbeddingSimilarityEvaluator(
    sentences1=eval_dataset["doc"],
    sentences2=eval_dataset["candidate"],
    scores=eval_dataset["label"],
    main_similarity=SimilarityFunction.COSINE,
    name="example-dev",
)

print(evaluator(model))

trainer = SentenceTransformerTrainer(
    model=model,
    loss=loss,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    evaluator=evaluator
)

trainer.train()

eneSadi

Nov 28, 2024

You can use this as an example if you want reproduce:

from datasets import Dataset

train_dataset = Dataset.from_dict({
    "doc": ["doc1", "doc2", "doc3"],
    "candidate": ["candidate1", "candidate2", "candidate3"],
    "label": [1, 0, 1],
})

eval_dataset = Dataset.from_dict({
    "doc": ["doc1", "doc2", "doc3"],
    "candidate": ["candidate1", "candidate2", "candidate3"],
    "label": [1, 0, 1],
})

jupyterjazz

Jina AI org Dec 3, 2024

Hi @eneSadi , it looks like flash-attention is not installed. You need flash-attention to train jina-embeddings-v3.