mixed-nlp's picture
add language
3a47ada verified
|
raw
history blame
47.2 kB
metadata
license: apache-2.0
language:
  - en



The crispy rerank family from mixedbread ai.

mxbai-colbert-v1

This is our first English ColBERT model, which is built upon our sentence embedding model mixedbread-ai/mxbai-embed-large-v1. You can learn more about the models in our blog post.

Quickstart

Currently, the best way to use it is with the official ColBERT library.

python -m pip install -U colbert-ai[faiss-gpu]

Here, we provide several ways to use it.

1. Generate Embeddings

from huggingface_hub import snapshot_download
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import Run, RunConfig, ColBERTConfig

# To ensure the total files are cached locally
snapshot_download(repo_id="mixedbread-ai/mxbai-colbert-v1")

# load mixedbread colbert
ckpt = Checkpoint("mixedbread-ai/mxbai-colbert-v1",
                  colbert_config=ColBERTConfig())

# encode query and documents
query = "Who wrote 'To Kill a Mockingbird'?"
documents = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]
query_vectors = ckpt.queryFromText([query], bsize=16)
doc_vectors = ckpt.docFromText(documents, bsize=16)

2. Index & Search

  1. Index
from huggingface_hub import snapshot_download
from colbert import Indexer
from colbert.infra import Run, RunConfig, ColBERTConfig

# To ensure the total files are cached locally
snapshot_download(repo_id="mixedbread-ai/mxbai-colbert-v1")


gpu_count = 1
documents = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]

with Run().context(RunConfig(nranks=gpu_count, gpus=gpu_count, experiment='experiments')):
    config = ColBERTConfig(
      doc_maxlen=512
    )
    indexer = Indexer(
      checkpoint="mixedbread-ai/mxbai-colbert-v1",
      config=config,
    )
    indexer.index(name='demo', collection=documents)
    
  1. Search
from colbert import Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig

gpu_count = 1

with Run().context(RunConfig(nranks=1, experiment='experiments')):
    config = ColBERTConfig(
      query_maxlen=128
    )
    searcher = Searcher(
      index='demo', 
      config=config
    )
    query = "Who wrote 'To Kill a Mockingbird'?"
    results = searcher.search(query, k=3)

Using API

You’ll be able to use the models through our API as well. The API is coming soon and will have some exciting features. Stay tuned!

Evaluation

1. Reranking Performance

Setup: we use BM25 as the first-stage retrieval model, and then use ColBERT for reranking. Following common practice, we report NDCG@10 as the metrics.

Here, we compare our model with two widely used ColBERT models, as follows:

Model ColBERTv2 Jina-ColBERT-V1 Mxbai-ColBERT-V1
dbpedia-entity 31.8 42.2 40.6
fiqa 23.6 35.6 35.9
nfcorpus 33.8 36.7 36.4
nq 30.6 51.3 51.4
scidocs 14.9 15.4 17.0
scifact 67.9 70.2 71.5
trec-covid 59.5 75.0 81.0
webis-touche2020 44.2 32.1 31.7
signal1m 33.2 30.9 33.1
trec-news 46.0 45.2 47.1
robust04 47.5 47.7 47.5
avg 39.4 43.8 44.8

Find more in our blog-post and on this spreadsheet.

2. Retrieval Performance

ColBERT is mainly used for reranking. Here, we also test our model's performance on retrieval tasks.

Due to resource limitations, we only test our model on three beir tasks. NDCG@10 servers as the main metric.

Model ColBERTv2 Jina-ColBERT-V1 Mxbai-ColBERT-V1
scifact 68.9 70.1 71.3
nfcorpus 33.7 33.8 36.5
trec-covid 72.6 75.0 80.5

Although our ColBERT also performs well on retrieval, we recommend using our embedding model mixedbread-ai/mxbai-embed-large-v1 for retrieval.

Community

Please join our Discord Community and share your feedback and thoughts! We are here to help and also always happy to chat.

License

Apache 2.0