Commit
·
3f4007a
1
Parent(s):
727b2f2
feat: update checkpoint to latest
Browse files- README.md +7 -2
- config.json +1 -1
- model.safetensors +2 -2
- preprocessor_config.json +1 -1
- pytorch_model.bin +2 -2
README.md
CHANGED
|
@@ -133,6 +133,7 @@ inference: false
|
|
| 133 |
<b>Jina CLIP: your CLIP model is also your text retriever!</b>
|
| 134 |
</p>
|
| 135 |
|
|
|
|
| 136 |
## Quick Start
|
| 137 |
|
| 138 |
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
|
|
@@ -144,8 +145,8 @@ inference: false
|
|
| 144 |
|
| 145 |
`jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
|
| 146 |
* *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
| 147 |
-
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and
|
| 148 |
-
* *visual document retrieval performance
|
| 149 |
|
| 150 |
Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
|
| 151 |
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
|
|
@@ -155,6 +156,7 @@ This dual capability makes it an excellent tool for multimodal retrieval-augment
|
|
| 155 |
|
| 156 |
[Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
|
| 157 |
|
|
|
|
| 158 |
## Usage
|
| 159 |
|
| 160 |
1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
|
|
@@ -252,6 +254,7 @@ console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cr
|
|
| 252 |
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
|
| 253 |
```
|
| 254 |
|
|
|
|
| 255 |
## Performance
|
| 256 |
|
| 257 |
### Text-Image Retrieval
|
|
@@ -262,10 +265,12 @@ Coming soon!
|
|
| 262 |
|
| 263 |
Coming soon!
|
| 264 |
|
|
|
|
| 265 |
## Contact
|
| 266 |
|
| 267 |
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|
| 268 |
|
|
|
|
| 269 |
## Citation
|
| 270 |
|
| 271 |
If you find `jina-clip-v2` useful in your research, please cite the following paper:
|
|
|
|
| 133 |
<b>Jina CLIP: your CLIP model is also your text retriever!</b>
|
| 134 |
</p>
|
| 135 |
|
| 136 |
+
|
| 137 |
## Quick Start
|
| 138 |
|
| 139 |
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
|
|
|
|
| 145 |
|
| 146 |
`jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
|
| 147 |
* *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
| 148 |
+
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
|
| 149 |
+
* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
|
| 150 |
|
| 151 |
Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
|
| 152 |
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
|
|
|
|
| 156 |
|
| 157 |
[Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
|
| 158 |
|
| 159 |
+
|
| 160 |
## Usage
|
| 161 |
|
| 162 |
1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
|
|
|
|
| 254 |
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
|
| 255 |
```
|
| 256 |
|
| 257 |
+
|
| 258 |
## Performance
|
| 259 |
|
| 260 |
### Text-Image Retrieval
|
|
|
|
| 265 |
|
| 266 |
Coming soon!
|
| 267 |
|
| 268 |
+
|
| 269 |
## Contact
|
| 270 |
|
| 271 |
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|
| 272 |
|
| 273 |
+
|
| 274 |
## Citation
|
| 275 |
|
| 276 |
If you find `jina-clip-v2` useful in your research, please cite the following paper:
|
config.json
CHANGED
|
@@ -43,7 +43,7 @@
|
|
| 43 |
"embed_dim": 1024,
|
| 44 |
"fused_layer_norm": false,
|
| 45 |
"head_width": 64,
|
| 46 |
-
"image_size":
|
| 47 |
"intp_freq": true,
|
| 48 |
"layers": 24,
|
| 49 |
"ls_init_value": null,
|
|
|
|
| 43 |
"embed_dim": 1024,
|
| 44 |
"fused_layer_norm": false,
|
| 45 |
"head_width": 64,
|
| 46 |
+
"image_size": 512,
|
| 47 |
"intp_freq": true,
|
| 48 |
"layers": 24,
|
| 49 |
"ls_init_value": null,
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:771f189199cdc89d19ea7f01c120cb370a8e38405b882b7d20f04347cc372e13
|
| 3 |
+
size 1730688642
|
preprocessor_config.json
CHANGED
|
@@ -13,7 +13,7 @@
|
|
| 13 |
],
|
| 14 |
"processor_class": "JinaCLIPProcessor",
|
| 15 |
"resize_mode": "shortest",
|
| 16 |
-
"size":
|
| 17 |
"std": [
|
| 18 |
0.26862954,
|
| 19 |
0.26130258,
|
|
|
|
| 13 |
],
|
| 14 |
"processor_class": "JinaCLIPProcessor",
|
| 15 |
"resize_mode": "shortest",
|
| 16 |
+
"size": 512,
|
| 17 |
"std": [
|
| 18 |
0.26862954,
|
| 19 |
0.26130258,
|
pytorch_model.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f1759bc4662735c42f65262d3d3477aa2dda6a947d6c504d9aaca17b5cd051d9
|
| 3 |
+
size 1730896230
|