vinsis
/

multimodal-patch-embeddings

Model card Files Files and versions Community

multimodal-patch-embeddings / README.md

vinsis's picture

Update README.md

97ff901 verified 4 months ago

|

history blame contribute delete

582 Bytes

	---
	license: mit
	---

	This model contains the checkpoint for the repo https://github.com/TinyVolt/multimodal-patch-embeddings. It contains the code for distillation of a 21.3M ViT model using CLIP ViT-B-32 model as the teacher. The model was trained on about 3 million images.

	What makes this model so special is that the embedding of each of the image patches is in the same embedding space as the final embedding. In fact, the final embedding is just a convex sum of the patch embeddings. This allows one to compare the text embedding with each of the 64 image patch embeddings.