ikala
/

ViT-B-16-SigLIP-i18n-256-hf

Zero-Shot Image Classification

Inference Endpoints

Model card Files Files and versions Community

ViT-B-16-SigLIP-i18n-256-hf / README.md

ikala-ray's picture

Create README.md

a574114 over 1 year ago

|

1.12 kB

	---
	tags:
	- clip
	- siglip
	library_name: open_clip
	pipeline_tag: zero-shot-image-classification
	license: apache-2.0
	datasets:
	- webli
	---
	# Model card for ViT-B-16-SigLIP-i18n-256

	A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.

	This model has been converted to from Open-CLIP : [timm/ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) to huggingface CLIPVisionModel

	```Python
	from transformers import CLIPVisionModel, CLIPImageProcessor
	from PIL import Image
	import requests
	url = "http://images.cocodataset.org/val2017/000000039769.jpg"

	image = Image.open(requests.get(url, stream=True).raw)
	inputs = image_processor(images=image, return_tensors="pt", padding=True)

	vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
	outputs = vision_tower(**inputs)

	logits_per_image = outputs.pooler_output # this is the image-text similarity score
	```

	There's still a slight different where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.