|
--- |
|
base_model: BAAI/bge-m3 |
|
language: |
|
- en |
|
- ru |
|
license: mit |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
--- |
|
|
|
# Model for English and Russian |
|
|
|
This is a truncated version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3). |
|
|
|
This model has only English and Russian tokens left in the vocabulary. Thus making it 1.5 smaller than the original model while producing the same embeddings. |
|
|
|
The model has been truncated in [this notebook](https://colab.research.google.com/drive/19IFjWpJpxQie1gtHSvDeoKk7CQtpy6bT?usp=sharing). |
|
|
|
## FAQ |
|
|
|
|
|
### Generate Embedding for text |
|
|
|
```python |
|
tokenizer = XLMRobertaTokenizer.from_pretrained('qilowoq/bge-m3-en-ru') |
|
model = XLMRobertaModel.from_pretrained('qilowoq/bge-m3-en-ru') |
|
|
|
sentences = ["This is an example sentence", "Это пример предложения"] |
|
|
|
with torch.no_grad(): |
|
embeddings = new_model(**tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)).pooler_output |
|
``` |
|
|
|
|
|
## Acknowledgement |
|
|
|
Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc. |
|
Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini). |
|
|
|
|
|
|
|
## Citation |
|
|
|
If you find this repository useful, please consider giving a star :star: and citation |
|
|
|
``` |
|
@misc{bge-m3, |
|
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, |
|
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, |
|
year={2024}, |
|
eprint={2402.03216}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |