Model Summary
MEXMA-SigLIP is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP sets state-of-the-art on the Crossmodal-3600 dataset across commercial use-friendly models.
How to use
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")
img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
probs = image_logits.softmax(dim=-1)
print(probs)
Acknowledgements
I thank ML Collective and Lambda for providing compute resources to train the model.
- Downloads last month
- 46
Unable to determine this model's library. Check the
docs
.