Aamir
commited on
Commit
•
d60b0d4
1
Parent(s):
3a47ada
Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,5 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
language:
|
4 |
-
- en
|
5 |
---
|
6 |
|
7 |
<br><br>
|
@@ -17,33 +15,23 @@ language:
|
|
17 |
# mxbai-colbert-v1
|
18 |
|
19 |
This is our first English ColBERT model, which is built upon our sentence embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1).
|
20 |
-
You can learn more about the models in our [blog post](https://www.mixedbread.ai/blog/).
|
21 |
|
22 |
|
23 |
## Quickstart
|
24 |
|
25 |
-
|
26 |
|
27 |
-
`
|
28 |
|
29 |
-
Here, we provide several ways to use it.
|
30 |
|
31 |
-
### 1. Generate Embeddings
|
32 |
|
33 |
```python
|
34 |
-
from
|
35 |
-
from colbert.modeling.checkpoint import Checkpoint
|
36 |
-
from colbert.infra import Run, RunConfig, ColBERTConfig
|
37 |
|
38 |
-
#
|
39 |
-
|
40 |
|
41 |
-
# load mixedbread colbert
|
42 |
-
ckpt = Checkpoint("mixedbread-ai/mxbai-colbert-v1",
|
43 |
-
colbert_config=ColBERTConfig())
|
44 |
-
|
45 |
-
# encode query and documents
|
46 |
-
query = "Who wrote 'To Kill a Mockingbird'?"
|
47 |
documents = [
|
48 |
"'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
|
49 |
"The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
|
@@ -52,63 +40,29 @@ documents = [
|
|
52 |
"The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
|
53 |
"'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
|
54 |
]
|
55 |
-
query_vectors = ckpt.queryFromText([query], bsize=16)
|
56 |
-
doc_vectors = ckpt.docFromText(documents, bsize=16)
|
57 |
-
```
|
58 |
|
59 |
-
|
60 |
-
|
61 |
-
1) Index
|
62 |
-
|
63 |
-
```python
|
64 |
-
from huggingface_hub import snapshot_download
|
65 |
-
from colbert import Indexer
|
66 |
-
from colbert.infra import Run, RunConfig, ColBERTConfig
|
67 |
|
68 |
-
#
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
gpu_count = 1
|
73 |
-
documents = [
|
74 |
-
"'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
|
75 |
-
"The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
|
76 |
-
"Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
|
77 |
-
"Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
|
78 |
-
"The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
|
79 |
-
"'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
|
80 |
-
]
|
81 |
-
|
82 |
-
with Run().context(RunConfig(nranks=gpu_count, gpus=gpu_count, experiment='experiments')):
|
83 |
-
config = ColBERTConfig(
|
84 |
-
doc_maxlen=512
|
85 |
-
)
|
86 |
-
indexer = Indexer(
|
87 |
-
checkpoint="mixedbread-ai/mxbai-colbert-v1",
|
88 |
-
config=config,
|
89 |
-
)
|
90 |
-
indexer.index(name='demo', collection=documents)
|
91 |
-
|
92 |
```
|
93 |
|
94 |
-
|
95 |
|
96 |
-
```
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
config=config
|
109 |
-
)
|
110 |
-
query = "Who wrote 'To Kill a Mockingbird'?"
|
111 |
-
results = searcher.search(query, k=3)
|
112 |
```
|
113 |
|
114 |
## Using API
|
@@ -119,40 +73,42 @@ You’ll be able to use the models through our API as well. The API is coming so
|
|
119 |
|
120 |
### 1. Reranking Performance
|
121 |
|
122 |
-
**Setup:** we use BM25 as the first-stage retrieval model, and then use ColBERT for reranking. Following common practice, we report NDCG@10 as the metrics.
|
123 |
|
124 |
Here, we compare our model with two widely used ColBERT models, as follows:
|
125 |
|
126 |
|
127 |
-
|
|
128 |
-
|
|
129 |
-
|
|
130 |
-
|
|
131 |
-
|
|
132 |
-
|
|
133 |
-
|
|
134 |
-
|
|
135 |
-
|
|
136 |
-
|
|
137 |
-
|
|
138 |
-
|
|
139 |
-
|
|
140 |
-
|
|
141 |
-
|
142 |
-
|
|
|
|
|
143 |
|
144 |
### 2. Retrieval Performance
|
145 |
|
146 |
-
ColBERT is mainly used for reranking. Here, we also test our model's performance on retrieval tasks.
|
147 |
|
148 |
Due to resource limitations, we only test our model on three beir tasks. NDCG@10 servers as the main metric.
|
149 |
|
150 |
|
151 |
-
| Model
|
152 |
-
|
|
153 |
-
|
|
154 |
-
|
|
155 |
-
|
|
156 |
|
157 |
Although our ColBERT also performs well on retrieval, we recommend using our embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) for retrieval.
|
158 |
|
@@ -162,4 +118,4 @@ Please join our [Discord Community](https://discord.gg/jDfMHzAVfU) and share you
|
|
162 |
|
163 |
|
164 |
## License
|
165 |
-
Apache 2.0
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
4 |
|
5 |
<br><br>
|
|
|
15 |
# mxbai-colbert-v1
|
16 |
|
17 |
This is our first English ColBERT model, which is built upon our sentence embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1).
|
18 |
+
You can learn more about the models in our [blog post](https://www.mixedbread.ai/blog/mxbai-colbert-large-v1).
|
19 |
|
20 |
|
21 |
## Quickstart
|
22 |
|
23 |
+
We recommend using the [RAGatouille](https://github.com/bclavie/RAGatouille) for using our ColBERT model.
|
24 |
|
25 |
+
`pip install ragatouille`
|
26 |
|
|
|
27 |
|
|
|
28 |
|
29 |
```python
|
30 |
+
from ragatouille import RAGPretrainedModel
|
|
|
|
|
31 |
|
32 |
+
# Let's create a ragatouille instance
|
33 |
+
RAG = RAGPretrainedModel("mixedbread-ai/mxbai-colbert-v1")
|
34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
documents = [
|
36 |
"'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
|
37 |
"The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
|
|
|
40 |
"The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
|
41 |
"'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
|
42 |
]
|
|
|
|
|
|
|
43 |
|
44 |
+
# index documents
|
45 |
+
RAG.index(documents, index_name="mockingbird")
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
+
# search
|
48 |
+
query = "Who wrote 'To Kill a Mockingbird'?"
|
49 |
+
results = RAG.search(query)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
```
|
51 |
|
52 |
+
The result looks like this:
|
53 |
|
54 |
+
```
|
55 |
+
[{'content': "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
|
56 |
+
'score': 28.453125,
|
57 |
+
'rank': 1,
|
58 |
+
'document_id': '9d564e82-f14f-433a-ab40-b10bda9dc370',
|
59 |
+
'passage_id': 0},
|
60 |
+
{'content': "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
|
61 |
+
'score': 27.03125,
|
62 |
+
'rank': 2,
|
63 |
+
'document_id': 'a35a89c3-b610-4e2e-863e-fa1e7e0710a6',
|
64 |
+
'passage_id': 2},
|
65 |
+
...]
|
|
|
|
|
|
|
|
|
66 |
```
|
67 |
|
68 |
## Using API
|
|
|
73 |
|
74 |
### 1. Reranking Performance
|
75 |
|
76 |
+
**Setup:** we use BM25 as the first-stage retrieval model, and then use ColBERT for reranking. We evaluate the out-of-domain performance on 13 public BEIR datasets. Following common practice, we report NDCG@10 as the metrics.
|
77 |
|
78 |
Here, we compare our model with two widely used ColBERT models, as follows:
|
79 |
|
80 |
|
81 |
+
| Dataset | ColBERTv2 | Jina-ColBERT-v1 | mxbai-colbert-large-v1 |
|
82 |
+
| ---------------- | --------: | --------------: | ---------------------: |
|
83 |
+
| ArguAna | 29.99 | **33.42** | 33.11 |
|
84 |
+
| ClimateFEVER | 16.51 | 20.66 | **20.85** |
|
85 |
+
| DBPedia | 31.80 | **42.16** | 40.61 |
|
86 |
+
| FEVER | 65.13 | **81.07** | 80.75 |
|
87 |
+
| FiQA | 23.61 | 35.60 | **35.86** |
|
88 |
+
| HotPotQA | 63.30 | **68.84** | 67.62 |
|
89 |
+
| NFCorpus | 33.75 | **36.69** | 36.37 |
|
90 |
+
| NQ | 30.55 | 51.27 | **51.43** |
|
91 |
+
| Quora | 78.86 | 85.18 | **86.95** |
|
92 |
+
| SCIDOCS | 14.90 | 15.39 | **16.98** |
|
93 |
+
| SciFact | 67.89 | 70.2 | **71.48** |
|
94 |
+
| TREC-COVID | 59.47 | 75.00 | **81.04** |
|
95 |
+
| Webis-touché2020 | **44.22** | 32.12 | 31.70 |
|
96 |
+
| Average | 43.08 | 49.82 | **50.37** |
|
97 |
+
|
98 |
+
Find more in our [blog-post](https://www.mixedbread.ai/blog/mxbai-rerank-v1) and on this [spreadsheet](https://docs.google.com/spreadsheets/d/1ZT_KN40PnHQa21hTdrk4_9GCnqm916lJJz3W83mo1og/edit?usp=sharing).
|
99 |
|
100 |
### 2. Retrieval Performance
|
101 |
|
102 |
+
ColBERT is mainly used for reranking. Here, we also test our model's performance on retrieval tasks on a subset of the BEIR benchmarks.
|
103 |
|
104 |
Due to resource limitations, we only test our model on three beir tasks. NDCG@10 servers as the main metric.
|
105 |
|
106 |
|
107 |
+
| Model | ColBERTv2 | Jina-ColBERT-V1 | mxbai-colbert-large-v1 |
|
108 |
+
| ---------- | --------: | --------------: | ---------------------: |
|
109 |
+
| NFCorpus | 33.7 | 33.8 | **36.5** |
|
110 |
+
| SciFact | 68.9 | 70.1 | **71.3** |
|
111 |
+
| TREC-COVID | 72.6 | 75.0 | **80.5** |
|
112 |
|
113 |
Although our ColBERT also performs well on retrieval, we recommend using our embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) for retrieval.
|
114 |
|
|
|
118 |
|
119 |
|
120 |
## License
|
121 |
+
Apache 2.0
|