|
--- |
|
pipeline_tag: sentence-similarity |
|
language: |
|
- es |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- LSG |
|
- STS |
|
- Long context |
|
license: apache-2.0 |
|
--- |
|
### LSG Variant of hiiamsid/sentence_similarity_spanish_es |
|
|
|
#### Overview |
|
|
|
This model is an enhanced version of [hiiamsid/sentence_similarity_spanish_es](https://huggingface.co/hiiamsid/sentence_similarity_spanish_es), now transformed using Local Sparse Global (LSG) attention mechanism. The adaptation to LSG allows for efficient handling of longer sequences, making the model more versatile and robust in a wider range of natural language processing tasks. |
|
|
|
This LSG adaptation enables the model to efficiently process sequences up to 4096 tokens in length. |
|
|
|
#### About the LSG architecture |
|
|
|
[LSG (Local Sparse Global)](https://github.com/ccdv-ai/convert_checkpoint_to_lsg) attention is a cutting-edge approach designed to mitigate the limitations of the traditional self-attention mechanism in Transformer models, particularly for processing long sequences. By incorporating local, sparse, and global attention, LSG attention significantly reduces computational complexity while maintaining, and often enhancing, model performance. |
|
|
|
#### Model adaptation |
|
|
|
This LSG variant has been adapted from the original model with the primary goal of extending its capabilities to efficiently handle longer text inputs. This enhancement enables the model to maintain high accuracy and efficiency, even with extended sequence lengths that were previously challenging for the base model. |
|
|
|
#### Use cases |
|
|
|
The LSG-enhanced model is particularly adept at tasks involving embeddings for longer documents. |
|
|
|
|
|
```python |
|
import torch.nn.functional as F |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('prudant/lsg_4096_sentence_similarity_spanish') |
|
model = AutoModel.from_pretrained('prudant/lsg_4096_sentence_similarity_spanish', trust_remote_code=True) |
|
|
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
# Sentences |
|
sentences = [ |
|
'Esa es una persona feliz', |
|
"Ese es un perro feliz", |
|
"Esa es una persona muy feliz", |
|
"Hoy es un día soleado", |
|
"Esa es una persona alegre", |
|
] |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
# Perform pooling. In this case, max pooling. |
|
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
|
|
print("Sentence embeddings:") |
|
print(sentence_embeddings) |
|
|
|
# Norm embeddings |
|
normalized_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) |
|
|
|
# Sentence similarity |
|
cosine_similarities = F.cosine_similarity(normalized_embeddings[0].unsqueeze(0), normalized_embeddings[1:], dim=1) |
|
|
|
print(cosine_similarities) |
|
``` |
|
|
|
Sentence embeddings: |
|
|
|
tensor([[-0.1691, -0.2517, -1.3000, ..., 0.1557, 0.3824, 0.2048], |
|
[ 0.1872, -0.7604, -0.4863, ..., -0.4922, -0.1511, -0.8539], |
|
[-0.2467, -0.2373, -1.1708, ..., 0.4637, 0.0616, 0.2841], |
|
[-0.2384, 0.1681, -0.3498, ..., -0.2744, -0.1722, -1.2513], |
|
[ 0.2273, -0.2393, -1.6124, ..., 0.6065, 0.2784, -0.3354]]) |
|
|
|
tensor([0.5132, 0.9346, 0.3471, 0.8543]) |
|
|
|
#### Acknowledgments |
|
|
|
This model has been adapted by Darío Muñoz Prudant, thanks to the Hugging Face community and contributors to the LSG attention mechanism for their resources and support. |
|
|