README.md · samim2024/clip at 18048784185d6e15198ff4d040c963ff187cc876

metadata

tags:
  - vision
library_name: transformers

Model Details

The CLIP model was pretrained from openai/clip-vit-base-patch32 , to learn about what contributes to robustness in computer vision tasks.

The model has the ability to generalize to arbitrary image classification tasks in a zero-shot manner.

Top predictions:

       Saree: 64.89%
     Dupatta: 25.81%
     Lehenga: 7.51%

Leggings and Salwar: 0.84% Women Kurta: 0.44%

Use with Transformers

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("samim2024/clip")
processor = CLIPProcessor.from_pretrained("samim2024/clip")

url = "https://www.istockphoto.com/photo/indian-saris-gm93355119-10451468"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a saree", "a photo of a blouse"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Model Use

Intended Use

The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.

Primary intended uses

The primary intended users of these models are AI researchers.

We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.

Out-of-Scope Use Cases

Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.

Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.

Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.