michaelfeil
commited on
Commit
•
18ed68f
1
Parent(s):
42dc317
Upload jinaai/jina-embedding-t-en-v1 ctranslate2 weights
Browse files- README.md +215 -0
- config.json +23 -0
- model.bin +3 -0
- modules.json +14 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +15 -0
- vocab.txt +0 -0
- vocabulary.json +0 -0
- vocabulary.txt +0 -0
README.md
ADDED
@@ -0,0 +1,215 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
tags:
|
4 |
+
- ctranslate2
|
5 |
+
- int8
|
6 |
+
- float16
|
7 |
+
- finetuner
|
8 |
+
- sentence-transformers
|
9 |
+
- feature-extraction
|
10 |
+
- sentence-similarity
|
11 |
+
datasets:
|
12 |
+
- jinaai/negation-dataset
|
13 |
+
language: en
|
14 |
+
license: apache-2.0
|
15 |
+
---
|
16 |
+
# # Fast-Inference with Ctranslate2
|
17 |
+
Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
|
18 |
+
|
19 |
+
quantized version of [jinaai/jina-embedding-t-en-v1](https://huggingface.co/jinaai/jina-embedding-t-en-v1)
|
20 |
+
```bash
|
21 |
+
pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1
|
22 |
+
```
|
23 |
+
|
24 |
+
```python
|
25 |
+
# from transformers import AutoTokenizer
|
26 |
+
model_name = "michaelfeil/ct2fast-jina-embedding-t-en-v1"
|
27 |
+
model_name_orig="jinaai/jina-embedding-t-en-v1"
|
28 |
+
|
29 |
+
from hf_hub_ctranslate2 import EncoderCT2fromHfHub
|
30 |
+
model = EncoderCT2fromHfHub(
|
31 |
+
# load in int8 on CUDA
|
32 |
+
model_name_or_path=model_name,
|
33 |
+
device="cuda",
|
34 |
+
compute_type="int8_float16"
|
35 |
+
)
|
36 |
+
outputs = model.generate(
|
37 |
+
text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
|
38 |
+
max_length=64,
|
39 |
+
) # perform downstream tasks on outputs
|
40 |
+
outputs["pooler_output"]
|
41 |
+
outputs["last_hidden_state"]
|
42 |
+
outputs["attention_mask"]
|
43 |
+
|
44 |
+
# alternative, use SentenceTransformer Mix-In
|
45 |
+
# for end-to-end Sentence embeddings generation
|
46 |
+
# (not pulling from this CT2fast-HF repo)
|
47 |
+
|
48 |
+
from hf_hub_ctranslate2 import CT2SentenceTransformer
|
49 |
+
model = CT2SentenceTransformer(
|
50 |
+
model_name_orig, compute_type="int8_float16", device="cuda"
|
51 |
+
)
|
52 |
+
embeddings = model.encode(
|
53 |
+
["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
|
54 |
+
batch_size=32,
|
55 |
+
convert_to_numpy=True,
|
56 |
+
normalize_embeddings=True,
|
57 |
+
)
|
58 |
+
print(embeddings.shape, embeddings)
|
59 |
+
scores = (embeddings @ embeddings.T) * 100
|
60 |
+
|
61 |
+
# Hint: you can also host this code via REST API and
|
62 |
+
# via github.com/michaelfeil/infinity
|
63 |
+
|
64 |
+
|
65 |
+
```
|
66 |
+
|
67 |
+
Checkpoint compatible to [ctranslate2>=3.17.1](https://github.com/OpenNMT/CTranslate2)
|
68 |
+
and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
|
69 |
+
- `compute_type=int8_float16` for `device="cuda"`
|
70 |
+
- `compute_type=int8` for `device="cpu"`
|
71 |
+
|
72 |
+
Converted on 2023-10-13 using
|
73 |
+
```
|
74 |
+
LLama-2 -> removed <pad> token.
|
75 |
+
```
|
76 |
+
|
77 |
+
# Licence and other remarks:
|
78 |
+
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
|
79 |
+
|
80 |
+
# Original description
|
81 |
+
|
82 |
+
|
83 |
+
<br><br>
|
84 |
+
|
85 |
+
<p align="center">
|
86 |
+
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
|
87 |
+
</p>
|
88 |
+
|
89 |
+
|
90 |
+
<p align="center">
|
91 |
+
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>, <a href="https://github.com/jina-ai/finetuner"><b>Finetuner</b></a> team.</b>
|
92 |
+
</p>
|
93 |
+
|
94 |
+
|
95 |
+
## Intented Usage & Model Info
|
96 |
+
|
97 |
+
`jina-embedding-t-en-v1` is a tiny language model that has been trained using Jina AI's Linnaeus-Clean dataset.
|
98 |
+
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
|
99 |
+
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
100 |
+
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
|
101 |
+
|
102 |
+
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
|
103 |
+
|
104 |
+
With a tiny small parameter size of just 14 million parameters,
|
105 |
+
the model enables lightning-fast inference on CPU, while still delivering impressive performance.
|
106 |
+
Additionally, we provide the following options:
|
107 |
+
|
108 |
+
- [`jina-embedding-t-en-v1`](https://huggingface.co/jinaai/jina-embedding-t-en-v1): 14 million parameters **(you are here)**.
|
109 |
+
- [`jina-embedding-s-en-v1`](https://huggingface.co/jinaai/jina-embedding-s-en-v1): 35 million parameters.
|
110 |
+
- [`jina-embedding-b-en-v1`](https://huggingface.co/jinaai/jina-embedding-b-en-v1): 110 million parameters.
|
111 |
+
- [`jina-embedding-l-en-v1`](https://huggingface.co/jinaai/jina-embedding-l-en-v1): 330 million parameters.
|
112 |
+
- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10 times bert-base (soon).
|
113 |
+
- `jina-embedding-6b-en-v1`: 6 billion parameters, 30 times bert-base (soon).
|
114 |
+
|
115 |
+
## Data & Parameters
|
116 |
+
|
117 |
+
Please checkout our [technical blog](https://arxiv.org/abs/2307.11224).
|
118 |
+
|
119 |
+
## Metrics
|
120 |
+
|
121 |
+
We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
|
122 |
+
|
123 |
+
|Name|param |dimension|
|
124 |
+
|------------------------------|-----|------|
|
125 |
+
|all-minilm-l6-v2|23m |384|
|
126 |
+
|all-mpnet-base-v2 |110m |768|
|
127 |
+
|ada-embedding-002|Unknown/OpenAI API |1536|
|
128 |
+
|jina-embedding-t-en-v1|14m |312|
|
129 |
+
|jina-embedding-s-en-v1|35m |512|
|
130 |
+
|jina-embedding-b-en-v1|110m |768|
|
131 |
+
|jina-embedding-l-en-v1|330m |1024|
|
132 |
+
|
133 |
+
|
134 |
+
|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|
135 |
+
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|
136 |
+
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
|
137 |
+
|all-mpnet-base-v2|0.726|**0.835**|0.78 |0.857|0.8 |**0.906**|0.513 |0.875|0.656 |
|
138 |
+
|ada-embedding-002|0.698|0.833|0.761|0.861|**0.86** |0.903|**0.685** |0.876|**0.726** |
|
139 |
+
|jina-embedding-t-en-v1|0.717|0.773|0.731|0.829|0.777|0.860|0.482 |0.840|0.522 |
|
140 |
+
|jina-embedding-s-en-v1|0.743|0.786|0.738|0.837|0.80|0.875|0.523 |0.857|0.524 |
|
141 |
+
|jina-embedding-b-en-v1|**0.751**|0.809|0.761|0.856|0.812|0.890|0.606 |0.876|0.594 |
|
142 |
+
|jina-embedding-l-en-v1|0.745|0.832|**0.781**|**0.869**|0.837|0.902|0.573 |**0.881**|0.598 |
|
143 |
+
|
144 |
+
## Inference Speed
|
145 |
+
|
146 |
+
We encoded a single sentence "What is the current weather like today?" 10k times on:
|
147 |
+
|
148 |
+
1. cpu: MacBook Pro 2020, 2 GHz Quad-Core Intel Core i5
|
149 |
+
2. gpu: 1 Nvidia 3090
|
150 |
+
|
151 |
+
And recorded time spent to demonstrate the embedding speed:
|
152 |
+
|
153 |
+
|Name|param |dimension| time@cpu | time@gpu |
|
154 |
+
|------------------------------|-----|------|-----|-----|
|
155 |
+
|jina-embedding-t-en-v1|14m |312| 5.78s | 2.36s|
|
156 |
+
|all-minilm-l6-v2|23m |384| 11.95s | 2.70s |
|
157 |
+
|jina-embedding-s-en-v1|35m |512| 17.25s | 2.81s |
|
158 |
+
|
159 |
+
|
160 |
+
## Usage
|
161 |
+
|
162 |
+
Use with Jina AI Finetuner
|
163 |
+
|
164 |
+
```python
|
165 |
+
!pip install finetuner
|
166 |
+
import finetuner
|
167 |
+
|
168 |
+
model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
|
169 |
+
embeddings = finetuner.encode(
|
170 |
+
model=model,
|
171 |
+
data=['how is the weather today', 'What is the current weather like today?']
|
172 |
+
)
|
173 |
+
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
|
174 |
+
```
|
175 |
+
|
176 |
+
Use with sentence-transformers:
|
177 |
+
|
178 |
+
```python
|
179 |
+
from sentence_transformers import SentenceTransformer
|
180 |
+
from sentence_transformers.util import cos_sim
|
181 |
+
|
182 |
+
sentences = ['how is the weather today', 'What is the current weather like today?']
|
183 |
+
|
184 |
+
model = SentenceTransformer('jinaai/jina-embedding-t-en-v1')
|
185 |
+
embeddings = model.encode(sentences)
|
186 |
+
print(cos_sim(embeddings[0], embeddings[1]))
|
187 |
+
```
|
188 |
+
|
189 |
+
## Fine-tuning
|
190 |
+
|
191 |
+
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
192 |
+
|
193 |
+
## Plans
|
194 |
+
|
195 |
+
1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
|
196 |
+
2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.
|
197 |
+
|
198 |
+
## Contact
|
199 |
+
|
200 |
+
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|
201 |
+
|
202 |
+
## Citation
|
203 |
+
|
204 |
+
If you find Jina Embeddings useful in your research, please cite the following paper:
|
205 |
+
|
206 |
+
``` latex
|
207 |
+
@misc{günther2023jina,
|
208 |
+
title={Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models},
|
209 |
+
author={Michael Günther and Louis Milliken and Jonathan Geuter and Georgios Mastrapas and Bo Wang and Han Xiao},
|
210 |
+
year={2023},
|
211 |
+
eprint={2307.11224},
|
212 |
+
archivePrefix={arXiv},
|
213 |
+
primaryClass={cs.CL}
|
214 |
+
}
|
215 |
+
```
|
config.json
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "tmp/",
|
3 |
+
"attention_probs_dropout_prob": 0.1,
|
4 |
+
"cell": {},
|
5 |
+
"model_type": "bert",
|
6 |
+
"emb_size": 312,
|
7 |
+
"hidden_act": "gelu",
|
8 |
+
"hidden_dropout_prob": 0.1,
|
9 |
+
"hidden_size": 312,
|
10 |
+
"initializer_range": 0.02,
|
11 |
+
"intermediate_size": 1200,
|
12 |
+
"max_position_embeddings": 512,
|
13 |
+
"num_attention_heads": 12,
|
14 |
+
"num_hidden_layers": 4,
|
15 |
+
"pre_trained": "",
|
16 |
+
"structure": [],
|
17 |
+
"type_vocab_size": 2,
|
18 |
+
"vocab_size": 30522,
|
19 |
+
"bos_token": "<s>",
|
20 |
+
"eos_token": "</s>",
|
21 |
+
"layer_norm_epsilon": 1e-12,
|
22 |
+
"unk_token": "[UNK]"
|
23 |
+
}
|
model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:dffc2308cb6291951ed6025ca2e9234d1011913a5496294d7dc545ec999ba824
|
3 |
+
size 28703620
|
modules.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
}
|
14 |
+
]
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 512,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"clean_up_tokenization_spaces": true,
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"do_basic_tokenize": true,
|
5 |
+
"do_lower_case": true,
|
6 |
+
"mask_token": "[MASK]",
|
7 |
+
"model_max_length": 1000000000000000019884624838656,
|
8 |
+
"never_split": null,
|
9 |
+
"pad_token": "[PAD]",
|
10 |
+
"sep_token": "[SEP]",
|
11 |
+
"strip_accents": null,
|
12 |
+
"tokenize_chinese_chars": true,
|
13 |
+
"tokenizer_class": "BertTokenizer",
|
14 |
+
"unk_token": "[UNK]"
|
15 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
vocabulary.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
vocabulary.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|