File size: 2,105 Bytes
3ac1261 c284b0d 3ac1261 c284b0d 3ac1261 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
license: mit
language:
- ar
- kn
- ar
- ka
- af
- kk
- am
- km
- ar
- ky
- ar
- ko
- as
- lo
- az
- ml
- az
- mr
- be
- mk
- bn
- my
- bs
- nl
- bg
- ca
- 'no'
- cs
- ne
- ku
- pl
- cy
- pt
- da
- ro
- de
- ru
- el
- sa
- en
- si
- eo
- sk
- et
- sl
- eu
- sd
- fi
- so
- fr
- es
- gd
- sr
- ga
- su
- gl
- sv
- gu
- sw
- ha
- ta
- he
- te
- hi
- th
- hr
- tr
- hu
- ug
- hy
- uk
- id
- ur
- is
- vi
- it
- xh
- jv
- zh
- ja
pipeline_tag: zero-shot-image-classification
tags:
- siglip
- clip
- mexma
---
## Model Summary
MEXMA-SigLIP is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the
[SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) model. This allows us to get a high-performance CLIP model for 80 languages.
MEXMA-SigLIP sets state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset across commercial use-friendly models.
## How to use
```
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")
img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
probs = image_logits.softmax(dim=-1)
print(probs)
```
## Acknowledgements
I thank [ML Collective](https://mlcollective.org/) and [Lambda](https://lambdalabs.com/) for providing compute resources to train the model. |