metadata
license: apache-2.0
language:
- it
Model: DistilUSE/span>
Lang: IT
Model description
This is a Universal Sentence Encoder [1] model for the Italian language, obtained using mDistilUSE (distiluse-base-multilingual-cased-v1) as a starting point and focusing it on the Italian language by modifying the embedding layer (as in [2], computing document-level frequencies over the Wikipedia dataset)
The resulting model has 67M parameters, a vocabulary of 30.785 tokens, and a size of ~270 MB.
It can be used to encode Italian texts and compute similarities between them.
Quick usage
from transformers import AutoTokenizer, AutoModel
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("../osiria/distiluse-base-italian/")
model = AutoModel.from_pretrained("../osiria/distiluse-base-italian/")
text1 = "Alessandro Manzoni è stato uno scrittore italiano"
text2 = "Giacomo Leopardi è stato un poeta italiano"
vec1 = model(tokenizer.encode(text1, return_tensors = "pt")).last_hidden_state[0,0,:].cpu().detach().numpy()
vec2 = model(tokenizer.encode(text2, return_tensors = "pt")).last_hidden_state[0,0,:].cpu().detach().numpy()
cosine_similarity = np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))
print("COSINE SIMILARITY:", cosine_similarity)
# COSINE SIMILARITY: 0.734292
References
[1] https://arxiv.org/abs/1907.04307
[2] https://arxiv.org/abs/2010.05609
License
The model is released under Apache-2.0 license