spacemanidol
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,228 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
tags:
|
4 |
+
- sentence-transformers
|
5 |
+
- feature-extraction
|
6 |
+
- sentence-similarity
|
7 |
+
- mteb
|
8 |
+
- arctic
|
9 |
+
- snowflake-arctic-embed
|
10 |
+
- transformers.js
|
11 |
+
license: apache-2.0
|
12 |
+
language:
|
13 |
+
- af
|
14 |
+
- ar
|
15 |
+
- az
|
16 |
+
- be
|
17 |
+
- bg
|
18 |
+
- bn
|
19 |
+
- ca
|
20 |
+
- ceb
|
21 |
+
- cs
|
22 |
+
- cy
|
23 |
+
- da
|
24 |
+
- de
|
25 |
+
- el
|
26 |
+
- en
|
27 |
+
- es
|
28 |
+
- et
|
29 |
+
- eu
|
30 |
+
- fa
|
31 |
+
- fi
|
32 |
+
- fr
|
33 |
+
- gl
|
34 |
+
- gu
|
35 |
+
- he
|
36 |
+
- hi
|
37 |
+
- hr
|
38 |
+
- ht
|
39 |
+
- hu
|
40 |
+
- hy
|
41 |
+
- id
|
42 |
+
- is
|
43 |
+
- it
|
44 |
+
- ja
|
45 |
+
- jv
|
46 |
+
- ka
|
47 |
+
- kk
|
48 |
+
- km
|
49 |
+
- kn
|
50 |
+
- ko
|
51 |
+
- ky
|
52 |
+
- lo
|
53 |
+
- lt
|
54 |
+
- lv
|
55 |
+
- mk
|
56 |
+
- ml
|
57 |
+
- mn
|
58 |
+
- mr
|
59 |
+
- ms
|
60 |
+
- my
|
61 |
+
- ne
|
62 |
+
- nl
|
63 |
+
- pa
|
64 |
+
- pl
|
65 |
+
- pt
|
66 |
+
- qu
|
67 |
+
- ro
|
68 |
+
- ru
|
69 |
+
- si
|
70 |
+
- sk
|
71 |
+
- sl
|
72 |
+
- so
|
73 |
+
- sq
|
74 |
+
- sr
|
75 |
+
- sv
|
76 |
+
- sw
|
77 |
+
- ta
|
78 |
+
- te
|
79 |
+
- th
|
80 |
+
- tl
|
81 |
+
- tr
|
82 |
+
- uk
|
83 |
+
- ur
|
84 |
+
- vi
|
85 |
+
- yo
|
86 |
+
- zh
|
87 |
+
---
|
88 |
+
<h1 align="center">Snowflake's Arctic-embed-m-v2.0</h1>
|
89 |
+
<h4 align="center">
|
90 |
+
<p>
|
91 |
+
<a href=#news>News</a> |
|
92 |
+
<a href=#models>Models</a> |
|
93 |
+
<a href=#usage>Usage</a> |
|
94 |
+
<a href="#evaluation">Evaluation</a> |
|
95 |
+
<a href="#contact">Contact</a> |
|
96 |
+
<a href="#faq">FAQ</a>
|
97 |
+
<a href="#license">License</a> |
|
98 |
+
<a href="#acknowledgement">Acknowledgement</a>
|
99 |
+
<p>
|
100 |
+
</h4>
|
101 |
+
|
102 |
+
|
103 |
+
## News
|
104 |
+
12/04/2024: Release of [snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) and [snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0) our newest models with multilingual workloads in mind.
|
105 |
+
|
106 |
+
|
107 |
+
## Models
|
108 |
+
Snowflake arctic-embed-m-v2.0 is the newest addition to the suite of embedding models Snowflake has released optimizing for retrieval performance and inference efficiency.
|
109 |
+
Arctic Embed 2.0 introduces a new standard for multilingual embedding models, combining high-quality multilingual text retrieval without sacrificing performance in English.
|
110 |
+
Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
|
111 |
+
|
112 |
+
Key Features:
|
113 |
+
|
114 |
+
1. Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
|
115 |
+
|
116 |
+
2. Inference efficiency: Its 113m non-embedding parameters inference is fast and efficient for any scale.
|
117 |
+
|
118 |
+
3. Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
|
119 |
+
|
120 |
+
4. Long Context Support: arctic-embed-m-v2.0 builds on [GTE-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) which can support a context window of up to 8192 via the use of RoPE.
|
121 |
+
|
122 |
+
|
123 |
+
### Quality Benchmarks
|
124 |
+
Unlike most other open-source models, Arctic-embed-m-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
|
125 |
+
You no longer need to support models to empower high-quality English and multilingual retrieval. All numbers mentioned below are the average NDCG@10 across the dataset being discussed.
|
126 |
+
|
127 |
+
| Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
|
128 |
+
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
129 |
+
| **snowflake-arctic-m-v2.0 | 305M | 113M | 768 | **55.4** | 55.2 | **51.7** | **53.9** |
|
130 |
+
| snowflake-arctic-m | 109M | 86M | 768 | 54.9 | 24.9 | 34.4 | 29.1 |
|
131 |
+
| me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
|
132 |
+
| bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | **56.8** | 40.8 | 41.3 |
|
133 |
+
| gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
|
134 |
+
| me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
|
135 |
+
| bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | 56.8 | 40.8 | 41.3 |
|
136 |
+
| gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
|
137 |
+
|
138 |
+
Aside from high-quality retrieval, arctic delivers embeddings that are easily compressible. By leveraging vector truncation via MRL to decrease vector size by 3x with about 3% degradation in quality.
|
139 |
+
Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
|
140 |
+
|
141 |
+
| Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
|
142 |
+
| snowflake-arctic-m-v2.0 | 768 | 55.4 | N/A | 55.2 | N/A | 51.7 | N/A | 53.9 | N/A |
|
143 |
+
| snowflake-arctic-m-v2.0 | 256 | 54.4 | -1.81% | 54.0 | -2.17% | 50.6 | -2.13% | 52.3 | -3.06% |
|
144 |
+
|
145 |
+
## Usage
|
146 |
+
|
147 |
+
### Using Sentence Transformers
|
148 |
+
|
149 |
+
```python
|
150 |
+
from sentence_transformers import SentenceTransformer
|
151 |
+
|
152 |
+
# Load the model
|
153 |
+
model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
|
154 |
+
model = SentenceTransformer(model_name, trust_remote_code=True)
|
155 |
+
|
156 |
+
# Define the queries and documents
|
157 |
+
queries = ['what is snowflake?', 'Where can I get the best tacos?']
|
158 |
+
documents = ['The Data Cloud!', 'Mexico City of Course!']
|
159 |
+
|
160 |
+
# Compute embeddings: use `prompt_name="query"` to encode queries!
|
161 |
+
query_embeddings = model.encode(queries, prompt_name="query")
|
162 |
+
document_embeddings = model.encode(documents)
|
163 |
+
|
164 |
+
# Compute cosine similarity scores
|
165 |
+
scores = model.similarity(query_embeddings, document_embeddings)
|
166 |
+
|
167 |
+
# Output the results
|
168 |
+
for query, query_scores in zip(queries, scores):
|
169 |
+
doc_score_pairs = list(zip(documents, query_scores))
|
170 |
+
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
|
171 |
+
print("Query:", query)
|
172 |
+
for document, score in doc_score_pairs:
|
173 |
+
print(score, document)
|
174 |
+
|
175 |
+
```
|
176 |
+
|
177 |
+
### Using Huggingface Transformers
|
178 |
+
|
179 |
+
|
180 |
+
You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
|
181 |
+
|
182 |
+
```python
|
183 |
+
import torch
|
184 |
+
from transformers import AutoModel, AutoTokenizer
|
185 |
+
|
186 |
+
model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
|
187 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
188 |
+
model = AutoModel.from_pretrained(model_name, add_pooling_layer=False, trust_remote_code=True)
|
189 |
+
model.eval()
|
190 |
+
|
191 |
+
query_prefix = 'Query: '
|
192 |
+
queries = ['what is snowflake?', 'Where can I get the best tacos?']
|
193 |
+
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
|
194 |
+
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192)
|
195 |
+
|
196 |
+
documents = ['The Data Cloud!', 'Mexico City of Course!']
|
197 |
+
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192)
|
198 |
+
|
199 |
+
# Compute token embeddings
|
200 |
+
with torch.no_grad():
|
201 |
+
query_embeddings = model(**query_tokens)[0][:, 0]
|
202 |
+
document_embeddings = model(**document_tokens)[0][:, 0]
|
203 |
+
|
204 |
+
|
205 |
+
# normalize embeddings
|
206 |
+
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
|
207 |
+
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
|
208 |
+
|
209 |
+
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
|
210 |
+
for query, query_scores in zip(queries, scores):
|
211 |
+
doc_score_pairs = list(zip(documents, query_scores))
|
212 |
+
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
|
213 |
+
#Output passages & scores
|
214 |
+
print("Query:", query)
|
215 |
+
for document, score in doc_score_pairs:
|
216 |
+
print(score, document)
|
217 |
+
```
|
218 |
+
|
219 |
+
|
220 |
+
## Contact
|
221 |
+
|
222 |
+
|
223 |
+
Feel free to open an issue or pull request if you have any questions or suggestions about this project.
|
224 |
+
You also can email Daniel Campos(daniel.campos@snowflake.com).
|
225 |
+
|
226 |
+
|
227 |
+
## License
|
228 |
+
Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-2.0). The released models can be used for commercial purposes free of charge.
|