metadata

license: cc-by-nc-4.0
language:
  - en
pipeline_tag: zero-shot-image-classification
widget:
  - src: https://huggingface.co/lhaas/StreetCLIP/resolve/main/nagasaki.jpg
    candidate_labels: China, South Korea, Japan, Phillipines, Taiwan, Vietnam, Cambodia
    example_title: Countries
  - src: https://huggingface.co/lhaas/StreetCLIP/resolve/main/sanfrancisco.jpeg
    candidate_labels: San Jose, San Diego, Los Angeles, Las Vegas, San Francisco, Seattle
    example_title: Cities
library_name: transformers
tags:
  - geolocalization
  - geolocation
  - geographic
  - street
  - climate
  - clip
  - urban
  - rural
  - multi-modal

Model Card for StreetCLIP

StreetCLIP is a robust foundation model for open-domain image geolocalization and other geographic and climate-related tasks.

Trained on a dataset of 1.1 million geo-tagged images, it achieves state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot, outperforming supervised models trained on millions of images.

Model Details

Model Description

Developed by: Authors not disclosed
Model type: CLIP
Language: English
License: Create Commons Attribution Non Commercial 4.0
Finetuned from model: openai/clip-vit-large-patch14-336

Model Sources

Paper: Pre-print available soon ...
Demo: Currently in development ...

Uses

To be added soon ...

Direct Use

To be added soon ...

Downstream Use

To be added soon ...

Out-of-Scope Use

To be added soon ...

Bias, Risks, and Limitations

To be added soon ...

Recommendations

To be added soon ...

How to Get Started with the Model

Use the code below to get started with the model.

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("geolocational/StreetCLIP")
processor = CLIPProcessor.from_pretrained("geolocational/StreetCLIP")

url = "https://huggingface.co/geolocational/StreetCLIP/resolve/main/sanfrancisco.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

choices = ["San Jose", "San Diego", "Los Angeles", "Las Vegas", "San Francisco"]
inputs = processor(text=choices, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Training Details

Training Data

StreetCLIP was trained on an undisclosed street-level dataset of 1.1 million real-world, urban and rural images. The data used to train the model comes from 101 countries.

Training Procedure

Preprocessing

Same preprocessing as openai/clip-vit-large-patch14-336.

Evaluation

StreetCLIP was evaluated in zero-shot on two open-domain image geolocalization benchmarks using a technique called hierarchical linear probing. Hierarchical linear probing sequentially attempts to identify the correct country and then city of geographical image origin.

Testing Data, Factors & Metrics

Testing Data

IM2GPS.
IM2GPS3K

Metrics

To be added soon ...

Results

To be added soon ...

Summary

Our experiments demonstrate that our synthetic caption pretraining method is capable of significantly improving CLIP's generalized zero-shot capabilities applied to open-domain image geolocalization while achieving SOTA performance on a selection of benchmark metrics.

Environmental Impact

Hardware Type: 4 NVIDIA A100 GPUs
Hours used: 12

Example Image Attribution

To be added soon ...

Citation

Preprint available soon ...

BibTeX:

Available soon ...