facebook
/

metaclip-2-mt5-worldwide-b32

Zero-Shot Image Classification

Model card Files Files and versions

metaclip-2-mt5-worldwide-b32 / README.md

voidism's picture

Update README.md

fbbce52 verified 22 days ago

|

history blame contribute delete

2.16 kB

	---
	library_name: transformers
	pipeline_tag: zero-shot-image-classification
	license: cc-by-nc-4.0
	tags:
	- clip
	- multilingual
	---

	# Model Card for Distilled MetaCLIP 2 ViT-B/32 (mT5 Tokenizer) (worldwide)

	Distilled MetaCLIP 2 (worldwide) was presented in [MetaCLIP 2: A Worldwide Scaling Recipe](https://huggingface.co/papers/2507.22062).

	This checkpoint corresponds to "ViT-B-32-mT5-worldwide" of the [original implementation](https://github.com/facebookresearch/MetaCLIP).

	## Install

	First install the Transformers library (from source for now):

	```bash
	pip install -q git+https://github.com/huggingface/transformers.git
	```

	## Usage

	Next you can use it like so:

	```python
	import torch
	from transformers import pipeline

	clip = pipeline(
	task="zero-shot-image-classification",
	model="facebook/metaclip-2-mt5-worldwide-b32",
	torch_dtype=torch.bfloat16,
	device=0
	)
	labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

	results = clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
	print(results)
	```

	In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:

	```python
	import requests
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoModel

	# note: make sure to verify that `AutoModel` is an instance of `MetaClip2Model`
	model = AutoModel.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
	processor = AutoProcessor.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32")

	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)
	labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

	inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

	outputs = model(**inputs)
	logits_per_image = outputs.logits_per_image
	probs = logits_per_image.softmax(dim=1)
	most_likely_idx = probs.argmax(dim=1).item()
	most_likely_label = labels[most_likely_idx]
	print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
	```