File size: 1,117 Bytes
a574114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
tags:
- clip
- siglip
library_name: open_clip
pipeline_tag: zero-shot-image-classification
license: apache-2.0
datasets:
- webli
---
# Model card for ViT-B-16-SigLIP-i18n-256

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.

This model has been converted to from Open-CLIP : [timm/ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) to huggingface CLIPVisionModel

```Python
from transformers import CLIPVisionModel, CLIPImageProcessor
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt", padding=True)

vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
outputs = vision_tower(**inputs)

logits_per_image = outputs.pooler_output  # this is the image-text similarity score
```

There's still a slight different where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.