Update README.md

2d7b768 verified 4 months ago

3.86 kB

	---
	tags:
	- mteb
	- sentence-transformers
	- transformers
	- multilingual
	- sentence-similarity
	license: apache-2.0
	---

	## gte-multilingual-base

	The gte-multilingual-base model is the latest in the [GTE](https://huggingface.co/collections/Alibaba-NLP/gte-models-6680f0b13f885cb431e6d469) (General Text Embedding) family of models, featuring several key attributes:

	- High Performance: Achieves state-of-the-art (SOTA) results in multilingual retrieval tasks and multi-task representation model evaluations when compared to models of similar size.
	- Training Architecture: Trained using an encoder-only transformers architecture, resulting in a smaller model size. Unlike previous models based on decode-only LLM architecture (e.g., gte-qwen2-1.5b-instruct), this model has lower hardware requirements for inference, offering a 10x increase in inference speed.
	- Long Context: Supports text lengths up to 8192 tokens.
	- Multilingual Capability: Supports over 70 languages.
	- Elastic Dense Embedding: Support elastic output dense representation while maintaining the effectiveness of downstream tasks, which significantly reduces storage costs and improves execution efficiency.
	- Sparse Vectors: In addition to dense representations, it can also generate sparse vectors.

	## Model Information
	- Model Size: 304M
	- Embedding Dimension: 768
	- Max Input Tokens: 8192

	## Requirements
	```
	transformers>=4.39.2
	flash_attn>=2.5.6
	```
	## Usage

	Get Dense Embeddings with Transformers
	```
	# Requires transformers>=4.36.0

	import torch.nn.functional as F
	from transformers import AutoModel, AutoTokenizer

	input_texts = [
	"what is the capital of China?",
	"how to implement quick sort in python?",
	"北京",
	"快排算法介绍"
	]

	model_path = 'Alibaba-NLP/gte-multilingual-base'
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

	# Tokenize the input texts
	batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

	outputs = model(**batch_dict)

	dimension=768 # The output dimension of the output embedding, should be in [128, 768]
	embeddings = outputs.last_hidden_state[:, 0][:dimension]

	embeddings = F.normalize(embeddings, p=2, dim=1)
	scores = (embeddings[:1] @ embeddings[1:].T) * 100
	print(scores.tolist())
	```

	Use with sentence-transformers
	```
	from sentence_transformers import SentenceTransformer
	from sentence_transformers.util import cos_sim

	input_texts = [
	"what is the capital of China?",
	"how to implement quick sort in python?",
	"北京",
	"快排算法介绍"
	]

	model = SentenceTransformer('Alibaba-NLP/gte-multilingual-base', trust_remote_code=True)
	embeddings = model.encode(input_texts)
	```

	Use with custom code to get dense embeddigns and sparse token weights
	```
	# You can find the gte_embeddings.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py
	from gte_embeddings import GTEEmbeddidng

	model_path = 'Alibaba-NLP/gte-multilingual-base'
	model = GTEEmbeddidng(model_path)
	query = "中国的首都在哪儿"

	docs = [
	"what is the capital of China?",
	"how to implement quick sort in python?",
	"北京",
	"快排算法介绍"
	]

	embs = model.encode(docs, return_dense=True,return_sparse=True)
	print('dense_embeddings vecs', embs['dense_embeddings'])
	print('token_weights', embs['token_weights'])
	pairs = [(query, doc) for doc in docs]
	dense_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.0)
	sparse_scores = model.compute_scores(pairs, dense_weight=0.0, sparse_weight=1.0)
	hybird_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.3)
	print('dense_scores', dense_scores)
	print('sparse_scores', sparse_scores)
	print('hybird_scores', hybird_scores)
	```