|
--- |
|
tags: |
|
- clip |
|
- siglip |
|
library_name: open_clip |
|
pipeline_tag: zero-shot-image-classification |
|
license: apache-2.0 |
|
datasets: |
|
- webli |
|
--- |
|
# Model card for ViT-B-16-SigLIP-i18n-256 |
|
|
|
A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI. |
|
|
|
This model has been converted to from Open-CLIP : [timm/ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) to huggingface CLIPVisionModel |
|
|
|
```Python |
|
from transformers import CLIPVisionModel, CLIPImageProcessor |
|
from PIL import Image |
|
import requests |
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
|
|
image = Image.open(requests.get(url, stream=True).raw) |
|
inputs = image_processor(images=image, return_tensors="pt", padding=True) |
|
|
|
vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf') |
|
outputs = vision_tower(**inputs) |
|
|
|
logits_per_image = outputs.pooler_output # this is the image-text similarity score |
|
``` |
|
|
|
There's still a slight different where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature. |
|
|
|
|