|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: zero-shot-image-classification |
|
|
license: cc-by-nc-4.0 |
|
|
tags: |
|
|
- clip |
|
|
- multilingual |
|
|
--- |
|
|
|
|
|
# Model Card for Distilled MetaCLIP 2 ViT-B/32 (mT5 Tokenizer) (worldwide) |
|
|
|
|
|
Distilled MetaCLIP 2 (worldwide) was presented in [MetaCLIP 2: A Worldwide Scaling Recipe](https://huggingface.co/papers/2507.22062). |
|
|
|
|
|
This checkpoint corresponds to "ViT-B-32-mT5-worldwide" of the [original implementation](https://github.com/facebookresearch/MetaCLIP). |
|
|
|
|
|
## Install |
|
|
|
|
|
First install the Transformers library (from source for now): |
|
|
|
|
|
```bash |
|
|
pip install -q git+https://github.com/huggingface/transformers.git |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
Next you can use it like so: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import pipeline |
|
|
|
|
|
clip = pipeline( |
|
|
task="zero-shot-image-classification", |
|
|
model="facebook/metaclip-2-mt5-worldwide-b32", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device=0 |
|
|
) |
|
|
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] |
|
|
|
|
|
results = clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels) |
|
|
print(results) |
|
|
``` |
|
|
|
|
|
In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API: |
|
|
|
|
|
```python |
|
|
import requests |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoProcessor, AutoModel |
|
|
|
|
|
# note: make sure to verify that `AutoModel` is an instance of `MetaClip2Model` |
|
|
model = AutoModel.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32", torch_dtype=torch.bfloat16, attn_implementation="sdpa") |
|
|
processor = AutoProcessor.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32") |
|
|
|
|
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] |
|
|
|
|
|
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) |
|
|
|
|
|
outputs = model(**inputs) |
|
|
logits_per_image = outputs.logits_per_image |
|
|
probs = logits_per_image.softmax(dim=1) |
|
|
most_likely_idx = probs.argmax(dim=1).item() |
|
|
most_likely_label = labels[most_likely_idx] |
|
|
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}") |
|
|
``` |
|
|
|