qilowoq
/

bge-m3-en-ru

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

bge-m3-en-ru / README.md

qilowoq's picture

Upload tokenizer

01924fa verified 3 months ago

|

history blame contribute delete

1.73 kB

	---
	base_model: BAAI/bge-m3
	language:
	- en
	- ru
	license: mit
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	---

	# Model for English and Russian

	This is a truncated version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3).

	This model has only English and Russian tokens left in the vocabulary. Thus making it 1.5 smaller than the original model while producing the same embeddings.

	The model has been truncated in [this notebook](https://colab.research.google.com/drive/19IFjWpJpxQie1gtHSvDeoKk7CQtpy6bT?usp=sharing).

	## FAQ


	### Generate Embedding for text

	```python
	tokenizer = XLMRobertaTokenizer.from_pretrained('qilowoq/bge-m3-en-ru')
	model = XLMRobertaModel.from_pretrained('qilowoq/bge-m3-en-ru')

	sentences = ["This is an example sentence", "Это пример предложения"]

	with torch.no_grad():
	embeddings = new_model(**tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)).pooler_output
	```


	## Acknowledgement

	Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
	Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).



	## Citation

	If you find this repository useful, please consider giving a star :star: and citation

	```
	@misc{bge-m3,
	title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
	author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
	year={2024},
	eprint={2402.03216},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```