Post
1625
Microsoft released LLM2CLIP: a CLIP model with longer context window for complex text inputs π€―
All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c
TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval.
This will enable better image-text retrieval, better zero-shot image classification, better vision language models π₯
Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)
All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c
TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval.
This will enable better image-text retrieval, better zero-shot image classification, better vision language models π₯
Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)