BAAI
/

bge-m3-retromae

@@ -4,9 +4,11 @@ tags:
 license: mit
 ---
 For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
-# BGE-M3
 In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
 - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
 - Multi-Linguality: It can support more than 100 working languages.
@@ -23,12 +25,14 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
 ## News:
 - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
 ## Specs
 - Model
 | Model Name |  Dimension | Sequence Length | Introduction |
 |:----:|:---:|:---:|:---:|
 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
@@ -45,7 +49,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
 | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
 ## FAQ
 **1. Introduction for different retrieval methods**
@@ -54,7 +57,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
 - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
 - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
 **2. Comparison with BGE-v1.5 and other monolingual models**
 BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
@@ -74,6 +76,11 @@ For sparse retrieval methods, most open-source libraries currently do not suppor
 Contributions from the community are welcome.
 **4. How to fine-tune bge-M3 model?**
 You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
@@ -215,10 +222,10 @@ print(model.compute_score(sentence_pairs,
 - Long Document Retrieval
   - MLDR:
   ![avatar](./imgs/long.jpg)
-  Please note that MLDR is a document retrieval dataset we constructed via LLM,
   covering 13 languages, including test set, validation set, and training set.
   We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
-  Therefore, comparing baseline with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
   Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
   We believe that this data will be helpful for the open-source community in training document retrieval models.
@@ -234,22 +241,29 @@ The small-batch strategy is simple but effective, which also can used to fine-tu
 - MCLS: A simple method to improve the performance on long text without fine-tuning.
 If you have no enough resource to fine-tuning model with long text, the method is useful.
-Refer to our [report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf) for more details.
 **The fine-tuning codes and datasets will be open-sourced in the near future.**
 ## Acknowledgement
-Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
-Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserini](https://github.com/castorini/pyserini).
 ## Citation
-If you find this repository useful, please consider giving a star :star: and a citation
 ```
 ```

 license: mit
 ---
 For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
+# BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
 In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
 - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
 - Multi-Linguality: It can support more than 100 working languages.
 ## News:
+- 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
 - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
 ## Specs
 - Model
 | Model Name |  Dimension | Sequence Length | Introduction |
 |:----:|:---:|:---:|:---:|
 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
 | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
 ## FAQ
 **1. Introduction for different retrieval methods**
 - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
 - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
 **2. Comparison with BGE-v1.5 and other monolingual models**
 BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
 Contributions from the community are welcome.
+In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval.
+**Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
+). Thanks @jobergum.**
 **4. How to fine-tune bge-M3 model?**
 You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
 - Long Document Retrieval
   - MLDR:
   ![avatar](./imgs/long.jpg)
+  Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
   covering 13 languages, including test set, validation set, and training set.
   We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
+  Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
   Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
   We believe that this data will be helpful for the open-source community in training document retrieval models.
 - MCLS: A simple method to improve the performance on long text without fine-tuning.
 If you have no enough resource to fine-tuning model with long text, the method is useful.
+Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
 **The fine-tuning codes and datasets will be open-sourced in the near future.**
 ## Acknowledgement
+Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
+Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserial](https://github.com/pyserial/pyserial).
 ## Citation
+If you find this repository useful, please consider giving a star :star: and citation
 ```
+@misc{bge-m3,
+      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
+      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
+      year={2024},
+      eprint={2402.03216},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
 ```