Edit model card

Multilingual-clip: XLM-Roberta-Large-Vit-B-32

Multilingual-CLIP extends OpenAI's English text encoders to multiple other languages. This model only contains the multilingual text encoder. The corresponding image model ViT-B-32 can be retrieved via instructions found on OpenAI's CLIP repository on Github. We provide a usage example below.

Requirements

To use both the multilingual text encoder and corresponding image encoder, we need to install the packages multilingual-clip and clip.

pip install multilingual-clip
pip install git+https://github.com/openai/CLIP.git

Usage

Extracting embeddings from the text encoder can be done in the following way:

from multilingual_clip import pt_multilingual_clip
import transformers

texts = [
    'Three blind horses listening to Mozart.',
    'Älgen är skogens konung!',
    'Wie leben Eisbären in der Antarktis?',
    'Вы знали, что все белые медведи левши?'
]
model_name = 'M-CLIP/XLM-Roberta-Large-Vit-B-32'

# Load Model & Tokenizer
model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

embeddings = model.forward(texts, tokenizer)
print("Text features shape:", embeddings.shape)

Extracting embeddings from the corresponding image encoder:

import torch
import clip
import requests
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

print("Image features shape:", image_features.shape) 

Evaluation results

None of the M-CLIP models have been extensivly evaluated, but testing them on Txt2Img retrieval on the humanly translated MS-COCO dataset, we see the following R@10 results:

Name En De Es Fr Zh It Pl Ko Ru Tr Jp
OpenAI CLIP Vit-B/32 90.3 - - - - - - - - - -
OpenAI CLIP Vit-L/14 91.8 - - - - - - - - - -
OpenCLIP ViT-B-16+- 94.3 - - - - - - - - - -
LABSE Vit-L/14 91.6 89.6 89.5 89.9 88.9 90.1 89.8 80.8 85.5 89.8 73.9
XLM-R Large Vit-B/32 91.8 88.7 89.1 89.4 89.3 89.8 91.4 82.1 86.1 88.8 81.0
XLM-R Vit-L/14 92.4 90.6 91.0 90.0 89.7 91.1 91.3 85.2 85.8 90.3 81.9
XLM-R Large Vit-B/16+ 95.0 93.0 93.6 93.1 94.0 93.1 94.4 89.0 90.0 93.0 84.2

Training/Model details

Further details about the model training and data can be found in the model card.

Downloads last month
11,921,272
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Spaces using M-CLIP/XLM-Roberta-Large-Vit-B-32 11