File size: 4,401 Bytes
654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 602759e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 38b1a9b 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c 654ba6e 0d55a9c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
library_name: transformers
language:
- en
pipeline_tag: image-feature-extraction
license: cc-by-nc-4.0
inference: false
---
# nomic-embed-vision-v1: Expanding the Latent Space
`nomic-embed-vision-v1` is a high performing vision embedding model that shares the same embedding space as [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5).
All Nomic Embed Text models are now **multimodal**!
| Name | Imagenet 0-shot | Datacomp (Avg. 38) | MTEB |
| :-------------------------------:| :-------------- | :----------------- | :------: |
| `nomic-embed-vision-v1.5` | **71.0** | **56.8** | 62.28 |
| `nomic-embed-vision-v1` | 70.7 | 56.7 | **62.39** |
| OpenAI CLIP ViT B/16 | 68.3 | 56.3 | 43.82 |
| Jina CLIP v1 | 59.1 | 52.2 | 60.1 |
## Hosted Inference API
The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
Generating embeddings with the `nomic` Python client is as easy as
```python
from nomic import embed
import numpy as np
output = embed.image(
images=[
"image_path_1.jpeg",
"image_path_2.png",
],
model='nomic-embed-vision-v1',
)
print(output['usage'])
embeddings = np.array(output['embeddings'])
print(embeddings.shape)
```
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-vision)
## Data Visualization
Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!
[![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/aKJogjDQ4BBiYGRIIrFMa.webp)](https://atlas.nomic.ai/data/nomic-multimodal-series/cc3m-100k-image-bytes-v15/map)
## Training Details
We align our vision embedder to the text embedding by employing a technique similar to [LiT](https://arxiv.org/abs/2111.07991) but instead lock the text embedder!
For more details, see the Nomic Embed Vision Technical Report (soon to be released!) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-vision)
Training code is released in the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
## Usage
Remember `nomic-embed-text` *requires* prefixes and so, when using Nomic Embed in multimodal RAG scenarios (e.g. text to image retrieval),
you should use the `search_query: ` prefix.
### Transformers
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
from PIL import Image
import requests
processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1")
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1", trust_remote_code=True)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(image, return_tensors="pt")
img_emb = vision_model(**inputs).last_hidden_state
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
```
Additionally, you can perform multimodal retrieval!
```python
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['search_query: What are cute animals to cuddle with?', 'search_query: What do cats look like?']
tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1')
text_model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
text_model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = text_model(**encoded_input)
text_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
text_embeddings = F.normalize(text_embeddings, p=2, dim=1)
print(torch.matmul(img_embeddings, text_embeddings.T))
```
# Join the Nomic Community
- Nomic: [https://nomic.ai](https://nomic.ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
|