Twitter4SSE
This model maps texts to 768 dimensional dense embeddings that encode semantic similarity. It was trained with Multiple Negatives Ranking Loss (MNRL) on a Twitter dataset. It was initialized from BERTweet and trained with Sentence-transformers.
Usage
The model is easier to use with sentence-trainsformers library
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = ["This is the first tweet", "This is the second tweet"]
model = SentenceTransformer('digio/Twitter4SSE')
embeddings = model.encode(sentences)
print(embeddings)
Without sentence-transfomer library, please refer to this repository for detailed instructions on how to use Sentence Transformers on Huggingface.
Citing & Authors
The official paper Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings will be presented at EMNLP 2021. Further details will be available soon.
@inproceedings{di-giovanni-brambilla-2021-exploiting,
title = "Exploiting {T}witter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings",
author = "Di Giovanni, Marco and
Brambilla, Marco",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.780",
pages = "9902--9910",
}
The official code is available on GitHub
- Downloads last month
- 11
Inference API (serverless) does not yet support transformers models for this pipeline type.