---
base_model: answerdotai/ModernBERT-base
datasets:
- lightonai/ms-marco-en-bge
language:
- en
library_name: PyLate
pipeline_tag: sentence-similarity
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:808728
- loss:Distillation
---
# PyLate model based on answerdotai/ModernBERT-base
This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the [train](https://huggingface.co/datasets/lightonai/ms-marco-en-bge) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
## Model Details
### Model Description
- **Model Type:** PyLate model
- **Base model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
- **Document Length:** 180 tokens
- **Query Length:** 32 tokens
- **Output Dimensionality:** 128 tokens
- **Similarity Function:** MaxSim
- **Training Dataset:**
- [train](https://huggingface.co/datasets/lightonai/ms-marco-en-bge)
- **Language:** en
### Model Sources
- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
### Full Model Architecture
```
ColBERT(
(0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
```
## Usage
First install the PyLate library:
```bash
pip install -U pylate
```
### Retrieval
PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
#### Indexing documents
First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
```python
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
# Step 2: Initialize the Voyager index
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
```
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
```python
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
)
```
#### Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
```python
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
```
### Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
```python
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
```
## Training Details
### Training Dataset
#### train
* Dataset: [train](https://huggingface.co/datasets/lightonai/ms-marco-en-bge) at [11e6ffa](https://huggingface.co/datasets/lightonai/ms-marco-en-bge/tree/11e6ffa1d22f461579f451eb31bdc964244cb61f)
* Size: 808,728 training samples
* Columns: query_id
, document_ids
, and scores
* Approximate statistics based on the first 1000 samples:
| | query_id | document_ids | scores |
|:--------|:--------------------------------------------------------------------------------|:------------------------------------|:------------------------------------|
| type | string | list | list |
| details |
121352
| ['2259784', '4923159', '40211', '1545154', '8527175', ...]
| [0.2343463897705078, 0.639204204082489, 0.3806908428668976, 0.5623092651367188, 0.8051995635032654, ...]
|
| 634306
| ['7723525', '1874779', '379307', '2738583', '7599583', ...]
| [0.7124203443527222, 0.7379189729690552, 0.5786551237106323, 0.6142299175262451, 0.6755089163780212, ...]
|
| 920825
| ['5976297', '2866112', '3560294', '3285659', '4706740', ...]
| [0.6462352871894836, 0.7880821228027344, 0.791019856929779, 0.7709633111953735, 0.8284491300582886, ...]
|
* Loss: pylate.losses.distillation.Distillation
### Evaluation Results
nDCG@10 scores for multi-vector retrieval models
| Model | SciFact | NFCorpus | FiQA | TREC-Covid |
| --------------------------- | --------- | -------- | --------- | ---------- |
| BERT | 71.5 | 34.2 | 35.0 | 69.9 |
| ModernBERT-Base (in paper) | 73.0 | **35.2** | 38.0 | **80.5** |
| ModernBERT-Base (this repo) | **73.88** | 34.96 | **39.47** | 79.36 |
### Training Hyperparameters
#### Non-Default Hyperparameters
- `per_device_train_batch_size`: 16
- `learning_rate`: 8e-05
- `num_train_epochs`: 1
- `warmup_ratio`: 0.05
- `bf16`: True
#### All Hyperparameters