Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
bwang0911 
posted an update Jun 3
Post
2437
we are very proud to introduce jinaai/jina-clip-v1, aka "jina-embeddings-multimodal".

The OpenAI CLIP openai/clip-vit-base-patch32 have nice performance to align text and image modality, that user can perform cross-modal text image retrieval or image classification on top of it. However, due to the training data and recipe, it can not:

1. model longer sequence of text inputs (77 token constraint).
2. align text representations (CLIP Text Tower is weak for text search).

In our latest publication, Jina CLIP: Your CLIP Model Is Also Your Text Retriever (2405.20204) , we proposed a multi-task, multi-objective learning scheme. The produced CLIP model shows:

1. Stronger cross-modal performance against OpenAI sets, 2% and 6% improvement on cross-modal retrieval recall@5.
2. Text tower of the JinaCLIP is a strong text encoder, reach the same performance as jinaai/jina-embeddings-v2-base-en, 165% improvement on MTEB[BEIR] recall@5.
3. Image tower of the JinaCLIP also shows strong performance in image-image search (CBIR), 12% recall improvement on Cifar100 test set.

If you are working on MuRAG (multimodal-retrieval argumented generation), try it out!


In this post