This model contains the checkpoint for the repo https://github.com/TinyVolt/multimodal-patch-embeddings. It contains the code for distillation of a 21.3M ViT model using CLIP ViT-B-32 model as the teacher. The model was trained on about 3 million images.

What makes this model so special is that the embedding of each of the image patches is in the same embedding space as the final embedding. In fact, the final embedding is just a convex sum of the patch embeddings. This allows one to compare the text embedding with each of the 64 image patch embeddings.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .