Regarding model fine-tuning

#23
by mylsz - opened

This is amazing work—thank you for your contribution and for making it open source!
Can you provide some suggestions on how to continue fine-tuning the jinaclip-v2 model with local data? After I comment out @torch .inference_mode(), can I directly load and fine-tune this model?

Jina AI org
edited Jan 14

Probably the forward method might be more convenient https://huggingface.co/jinaai/jina-clip-implementation/blob/main/modeling_clip.py#L637. You can of course fine-tune the model, I suggest you take a look at our technical report to understand our training recipe https://arxiv.org/abs/2412.08802

Basically if you care about text retrieval performance, you need to maintain the performance by adding text pairs (or even text triplets) alongside image-caption pairs. Otherwise simple CLIP like fine-tuning would be enough

@gmastrapas thanks for your advise. Do you have the training code public?

Jina AI org

No I am afraid the training code is not public

@gmastrapas Thanks!

mylsz changed discussion status to closed

@gmastrapas @mylsz @AaronJim Hi, every one, I have written a project to finetune jina-clip-v2. It is training now. But I have only one 4090 gpu card. So the max batchsize I can set is 5. I am unsure whether using a small batch size, such as 5, for the contrastive loss in CLIP will negatively affect the training results. If there is an impact, are there any good methods to achieve a larger batch size on a single GPU?
This is my training code:https://github.com/tengshaofeng/finetune-jina-clip-v2/blob/main/train_clip.py

Jina AI org

Are you using flash attention? If you dont care about sequence length that match, you can always reduce the max token length to fit a bigger batch. Also you can increase the patch dropout rate in the vision tower to discard more visual tokens during training

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment