spacemanidol commited on
Commit
0fcceba
·
verified ·
1 Parent(s): 601add0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +228 -3
README.md CHANGED
@@ -1,3 +1,228 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - mteb
8
+ - arctic
9
+ - snowflake-arctic-embed
10
+ - transformers.js
11
+ license: apache-2.0
12
+ language:
13
+ - af
14
+ - ar
15
+ - az
16
+ - be
17
+ - bg
18
+ - bn
19
+ - ca
20
+ - ceb
21
+ - cs
22
+ - cy
23
+ - da
24
+ - de
25
+ - el
26
+ - en
27
+ - es
28
+ - et
29
+ - eu
30
+ - fa
31
+ - fi
32
+ - fr
33
+ - gl
34
+ - gu
35
+ - he
36
+ - hi
37
+ - hr
38
+ - ht
39
+ - hu
40
+ - hy
41
+ - id
42
+ - is
43
+ - it
44
+ - ja
45
+ - jv
46
+ - ka
47
+ - kk
48
+ - km
49
+ - kn
50
+ - ko
51
+ - ky
52
+ - lo
53
+ - lt
54
+ - lv
55
+ - mk
56
+ - ml
57
+ - mn
58
+ - mr
59
+ - ms
60
+ - my
61
+ - ne
62
+ - nl
63
+ - pa
64
+ - pl
65
+ - pt
66
+ - qu
67
+ - ro
68
+ - ru
69
+ - si
70
+ - sk
71
+ - sl
72
+ - so
73
+ - sq
74
+ - sr
75
+ - sv
76
+ - sw
77
+ - ta
78
+ - te
79
+ - th
80
+ - tl
81
+ - tr
82
+ - uk
83
+ - ur
84
+ - vi
85
+ - yo
86
+ - zh
87
+ ---
88
+ <h1 align="center">Snowflake's Arctic-embed-m-v2.0</h1>
89
+ <h4 align="center">
90
+ <p>
91
+ <a href=#news>News</a> |
92
+ <a href=#models>Models</a> |
93
+ <a href=#usage>Usage</a> |
94
+ <a href="#evaluation">Evaluation</a> |
95
+ <a href="#contact">Contact</a> |
96
+ <a href="#faq">FAQ</a>
97
+ <a href="#license">License</a> |
98
+ <a href="#acknowledgement">Acknowledgement</a>
99
+ <p>
100
+ </h4>
101
+
102
+
103
+ ## News
104
+ 12/04/2024: Release of [snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) and [snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0) our newest models with multilingual workloads in mind.
105
+
106
+
107
+ ## Models
108
+ Snowflake arctic-embed-m-v2.0 is the newest addition to the suite of embedding models Snowflake has released optimizing for retrieval performance and inference efficiency.
109
+ Arctic Embed 2.0 introduces a new standard for multilingual embedding models, combining high-quality multilingual text retrieval without sacrificing performance in English.
110
+ Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
111
+
112
+ Key Features:
113
+
114
+ 1. Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
115
+
116
+ 2. Inference efficiency: Its 113m non-embedding parameters inference is fast and efficient for any scale.
117
+
118
+ 3. Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
119
+
120
+ 4. Long Context Support: arctic-embed-m-v2.0 builds on [GTE-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) which can support a context window of up to 8192 via the use of RoPE.
121
+
122
+
123
+ ### Quality Benchmarks
124
+ Unlike most other open-source models, Arctic-embed-m-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
125
+ You no longer need to support models to empower high-quality English and multilingual retrieval. All numbers mentioned below are the average NDCG@10 across the dataset being discussed.
126
+
127
+ | Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
128
+ |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
129
+ | **snowflake-arctic-m-v2.0 | 305M | 113M | 768 | **55.4** | 55.2 | **51.7** | **53.9** |
130
+ | snowflake-arctic-m | 109M | 86M | 768 | 54.9 | 24.9 | 34.4 | 29.1 |
131
+ | me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
132
+ | bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | **56.8** | 40.8 | 41.3 |
133
+ | gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
134
+ | me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
135
+ | bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | 56.8 | 40.8 | 41.3 |
136
+ | gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
137
+
138
+ Aside from high-quality retrieval, arctic delivers embeddings that are easily compressible. By leveraging vector truncation via MRL to decrease vector size by 3x with about 3% degradation in quality.
139
+ Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
140
+
141
+ | Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
142
+ | snowflake-arctic-m-v2.0 | 768 | 55.4 | N/A | 55.2 | N/A | 51.7 | N/A | 53.9 | N/A |
143
+ | snowflake-arctic-m-v2.0 | 256 | 54.4 | -1.81% | 54.0 | -2.17% | 50.6 | -2.13% | 52.3 | -3.06% |
144
+
145
+ ## Usage
146
+
147
+ ### Using Sentence Transformers
148
+
149
+ ```python
150
+ from sentence_transformers import SentenceTransformer
151
+
152
+ # Load the model
153
+ model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
154
+ model = SentenceTransformer(model_name, trust_remote_code=True)
155
+
156
+ # Define the queries and documents
157
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
158
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
159
+
160
+ # Compute embeddings: use `prompt_name="query"` to encode queries!
161
+ query_embeddings = model.encode(queries, prompt_name="query")
162
+ document_embeddings = model.encode(documents)
163
+
164
+ # Compute cosine similarity scores
165
+ scores = model.similarity(query_embeddings, document_embeddings)
166
+
167
+ # Output the results
168
+ for query, query_scores in zip(queries, scores):
169
+ doc_score_pairs = list(zip(documents, query_scores))
170
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
171
+ print("Query:", query)
172
+ for document, score in doc_score_pairs:
173
+ print(score, document)
174
+
175
+ ```
176
+
177
+ ### Using Huggingface Transformers
178
+
179
+
180
+ You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
181
+
182
+ ```python
183
+ import torch
184
+ from transformers import AutoModel, AutoTokenizer
185
+
186
+ model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
187
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
188
+ model = AutoModel.from_pretrained(model_name, add_pooling_layer=False, trust_remote_code=True)
189
+ model.eval()
190
+
191
+ query_prefix = 'Query: '
192
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
193
+ queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
194
+ query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192)
195
+
196
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
197
+ document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192)
198
+
199
+ # Compute token embeddings
200
+ with torch.no_grad():
201
+ query_embeddings = model(**query_tokens)[0][:, 0]
202
+ document_embeddings = model(**document_tokens)[0][:, 0]
203
+
204
+
205
+ # normalize embeddings
206
+ query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
207
+ document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
208
+
209
+ scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
210
+ for query, query_scores in zip(queries, scores):
211
+ doc_score_pairs = list(zip(documents, query_scores))
212
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
213
+ #Output passages & scores
214
+ print("Query:", query)
215
+ for document, score in doc_score_pairs:
216
+ print(score, document)
217
+ ```
218
+
219
+
220
+ ## Contact
221
+
222
+
223
+ Feel free to open an issue or pull request if you have any questions or suggestions about this project.
224
+ You also can email Daniel Campos(daniel.campos@snowflake.com).
225
+
226
+
227
+ ## License
228
+ Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-2.0). The released models can be used for commercial purposes free of charge.