File size: 8,770 Bytes
0d0aef2 627d6d6 5c5c2fb 6d4f4da 5c5c2fb 6d4f4da e0bcfc8 70b59ee 94ee939 70b59ee d42850a 70b59ee d42850a 70b59ee 255e5ee d42850a 255e5ee d42850a 70b59ee 94ee939 70b59ee 255e5ee 70b59ee 255e5ee 70b59ee d42850a 70b59ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: bigscience-bloom-rail-1.0
datasets:
- cmarkea/mmarco-contrastive
language:
- fr
- en
pipeline_tag: feature-extraction
base_model:
- cmarkea/bloomz-3b-dpo-chat
---
# Bloomz-3b-retriever-v2
## Presentation
We introduce the Bloomz-3b-retriever-v2 model, based on the [Bloomz-3b-dpo-chat](https://huggingface.co/ArkeaIAF/bloomz-3b-dpo-chat) model. This bi-encoder projects articles and queries into the same vector space, ensuring the proximity of queries to related articles. The model is language-agnostic for French and English, meaning a query in either language will be close to an article regardless of whether it is in French or English. This model is ideal for Open Domain Question Answering (ODQA) and can be complemented by the rerankers [Bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) or [Bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking).
## Training
The training dataset used is a variant of [mMARCO](https://huggingface.co/datasets/cmarkea/mmarco-contrastive) enabling contrastive learning and filtering out false negatives. The filtering threshold was set at 0.8, and a positive observation is confronted with 10 hard negatives, ordered by decreasing score (the 10 hardest). The model was trained on a uniform distribution of languages (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is of the InfoNCE type with a trainable temperature parameter as presented for the [CLIP](https://arxiv.org/abs/2103.00020) model.
## Note
Unlike the [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever), this much more efficient model uses cosine distance as its metric (instead of L2 distance as previously).
## Benchmark
The performance evaluation is based on the evaluation portion of SQuAD (5921 queries over 1204 articles across 35 different topics). One interesting aspect of this dataset is having multiple articles associated with a single theme, representing challenging contexts where a query may be close to several relevant articles. On average, there are about thirty articles per theme (see [Bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) for the exact distribution).
We compare the performances using the average top rank of the articles targeted by a query (Top-mean), the standard deviation of the top ranks (Top-std), the percentage of correct articles within Top-1, Top-5, and Top-10, and finally, the mean reciprocal rank (MRR) across the 1204 articles.
| Model (FR/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) | MRR (%) |
|----------------------------------------------------------------------------------------------------:|:--------:|:-------:|:---------:|:---------:|:----------:|:--------:|
| BM25 | 16.8 | 100.8 | 71.7 | 88.3 | 91.8 | 79.2 |
| [CamemBERT](https://huggingface.co/camembert/camembert-base) | 269.6 | 303.0 | 5.6 | 12.5 | 16.5 | 9.7 |
| [STS-CamemBERT](h4c5/sts-camembert-base) | 23.1 | 85.5 | 36.0 | 63.0 | 74.0 | 48.5 |
| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 10.2 | 40.1 | 43.9 | 73.9 | 84.0 | 57.3 |
| [E5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 6.1 | 29.7 | 59.9 | 84.9 | 91.0 | 71.1 |
| [E5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 5.2 | 29.2 | 67.0 | 89.2 | 93.7 | 76.7 |
| [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) | 10.2 | 46.6 | 51.5 | 78.1 | 86.2 | 63.5 |
| [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) | 8.8 | 36.4 | 49.2 | 77.5 | 86.1 | 62.0 |
| [Bloomz-560m-retriever-v2](https://huggingface.co/cmarkea/bloomz-560m-retriever-v2) | 4.0 | 17.1 | 68.0 | 89.9 | 94.4 | 77.7 |
| [Bloomz-3b-retriever-v2](https://huggingface.co/cmarkea/bloomz-3b-retriever-v2) | 2.8 | 14.8 | 76.5 | 94.4 | 97.2 | 84.4 |
| Model (EN/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) | MRR (%) |
|----------------------------------------------------------------------------------------------------:|:--------:|:-------:|:---------:|:---------:|:----------:|:--------:|
| BM25 | 280.7 | 371.8 | 23.9 | 37.4 | 43.3 | 30.4 |
| [CamemBERT](https://huggingface.co/camembert/camembert-base) | 355.0 | 328.3 | 0.9 | 3.7 | 6.4 | 3.13 |
| [STS-CamemBERT](h4c5/sts-camembert-base) | 102.2 | 196.9 | 13.1 | 30.5 | 40.7 | 22.1 |
| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 10.6 | 41.2 | 43.3 | 72.4 | 82.7 | 56.5 |
| [E5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 9.9 | 38.1 | 49.8 | 77.2 | 85.4 | 62.6 |
| [E5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 5.6 | 26.9 | 62.9 | 86.9 | 92.5 | 73.8 |
| [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) | 11.0 | 47.8 | 48.3 | 75.7 | 84.7 | 60.4 |
| [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) | 8.9 | 37.6 | 48.8 | 77.4 | 86.1 | 61.6 |
| [Bloomz-560m-retriever-v2](https://huggingface.co/cmarkea/bloomz-560m-retriever-v2) | 4.4 | 18.9 | 66.6 | 89.3 | 94.1 | 76.6 |
| [Bloomz-3b-retriever-v2](https://huggingface.co/cmarkea/bloomz-3b-retriever-v2) | 2.7 | 14.2 | 75.7 | 94.5 | 97.1 | 83.9 |
## How to Use Bloomz-3b-retriever-v2
**With Transformers API:**
```python
from typing import Union, List
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from scipy.spatial.distance import cdist
tokenizer = AutoTokenizer.from_pretrained('cmarkea/bloomz-3b-retriever-v2')
model = AutoModel.from_pretrained('cmarkea/bloomz-3b-retriever-v2')
def infer(txt: Union[str, List[str]]):
tok = tokenizer(txt, padding=True, return_tensors='pt')
with torch.no_grad():
embedding = model(**tok)
# Inportant: take only last token!
return embedding.get('last_hidden_state')[:,-1,:].numpy()
list_of_contexts: List[str] = [...]
emb_contexts = infer(list_of_contexts)
list_of_queries: List[str] = [...]
emb_queries = infer(list_of_queries)
# Important: take cosine distance!
dist = cdist(emb_queries, emb_contexts, 'cosine')
top_k = lambda x: [
[list_of_contexts[qq] for qq in ii]
for ii in dist.argsort(axis=-1)[:,:x]
]
# top 5 nearest contexts for each queries
top_contexts = top_k(5)
```
**With Pipeline API:**
```python
import numpy as np
from transformers import pipeline
from scipy.spatial.distance import cdist
retriever = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever-v2')
# Inportant: take only last token!
infer = lambda x: [ii[0][-1] for ii in retriever(x)]
list_of_contexts: List[str] = [...]
emb_contexts = np.concatenate(infer(list_of_contexts), axis=0)
list_of_queries: List[str] = [...]
emb_queries = np.concatenate(infer(list_of_queries), axis=0)
# Important: take cosine distance!
dist = cdist(emb_queries, emb_contexts, 'cosine')
top_k = lambda x: [
[list_of_contexts[qq] for qq in ii]
for ii in dist.argsort(axis=-1)[:,:x]
]
# top 5 nearest contexts for each queries
top_contexts = top_k(5)
```
Citation
--------
```bibtex
@online{DeBloomzRetv2,
AUTHOR = {Cyrile Delestre},
ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
URL = {https://huggingface.co/cmarkea/bloomz-3b-retriever-v2},
YEAR = {2024},
KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}
``` |