jinaai
/

jina-clip-v2

Model card Files Files and versions Community

Regarding model fine-tuning

#23

by mylsz - opened Dec 27, 2024

Dec 27, 2024

This is amazing work—thank you for your contribution and for making it open source!
Can you provide some suggestions on how to continue fine-tuning the jinaclip-v2 model with local data? After I comment out @torch .inference_mode(), can I directly load and fine-tune this model?

gmastrapas

Jina AI org Jan 14

•

edited Jan 14

Probably the forward method might be more convenient https://huggingface.co/jinaai/jina-clip-implementation/blob/main/modeling_clip.py#L637. You can of course fine-tune the model, I suggest you take a look at our technical report to understand our training recipe https://arxiv.org/abs/2412.08802

Basically if you care about text retrieval performance, you need to maintain the performance by adding text pairs (or even text triplets) alongside image-caption pairs. Otherwise simple CLIP like fine-tuning would be enough

bojohn

Jan 21

@gmastrapas thanks for your advise. Do you have the training code public?

gmastrapas

Jina AI org Jan 21

No I am afraid the training code is not public

mylsz

Jan 21

@gmastrapas Thanks！

mylsz changed discussion status to closed Jan 21

bojohn

about 1 month ago

@gmastrapas @mylsz @AaronJim Hi, every one, I have written a project to finetune jina-clip-v2. It is training now. But I have only one 4090 gpu card. So the max batchsize I can set is 5. I am unsure whether using a small batch size, such as 5, for the contrastive loss in CLIP will negatively affect the training results. If there is an impact, are there any good methods to achieve a larger batch size on a single GPU?
This is my training code:https://github.com/tengshaofeng/finetune-jina-clip-v2/blob/main/train_clip.py

gmastrapas

Jina AI org 6 days ago

Are you using flash attention? If you dont care about sequence length that match, you can always reduce the max token length to fit a bigger batch. Also you can increase the patch dropout rate in the vision tower to discard more visual tokens during training

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment