Spaces:
Running
Running
File size: 2,309 Bytes
94e735e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
SigLIP just got merged to 🤗transformers and it's super easy to use! To celebrate this, I have created a repository on various SigLIP based projects!
But what is it and how does it work? SigLIP an vision-text pre-training technique based on contrastive learning.
It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs.
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎
![image_1](image_1.jpg)
Highlights✨
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
😍 More performant than CLIP on zero-shot
🗣️ Authors trained a multilingual model too!
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k (see saturation on perf below)
![image_2](image_2.jpg)
Below you can find prior CLIP models and SigLIP across different image encoder sizes and their performance on different datasets 👇🏻
![image_3](image_3.jpg)
With 🤗 Transformers integration there comes zero-shot-image-classification pipeline, makes SigLIP super easy to use!
![image_4](image_4.jpg)
What to use SigLIP for? 🧐
Honestly the possibilities are endless, but you can use it for image/text retrieval, zero-shot classification, training multimodal models!
I have made a repository with notebooks and applications that are also hosted on [Spaces ](https://t.co/Ah1CrHVuPY).
I have built ["Draw to Search Art"](https://t.co/DcmQWMc1qd) where you can input image (upload one or draw) and search among 10k images in wikiart!
I've also built apps to [compare](https://t.co/m699TMvuW9)CLIP and SigLIP outputs.
![image_5](image_5.jpg)
> [!TIP]
Ressources:
[Sigmoid Loss for Language Image Pre-Training](Sigmoid Loss for Language Image Pre-Training)
by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer (2023)
[GitHub](https://github.com/google-research/big_vision)
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/siglip)
> [!NOTE]
[Original tweet](https://twitter.com/mervenoyann/status/1745476609686089800) (January 11. 2024) |