Twitter4SSE

This model maps texts to 768 dimensional dense embeddings that encode semantic similarity. It was trained with Multiple Negatives Ranking Loss (MNRL) on a Twitter dataset. It was initialized from BERTweet and trained with Sentence-transformers.

Usage

The model is easier to use with sentence-trainsformers library

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
sentences = ["This is the first tweet", "This is the second tweet"]

model = SentenceTransformer('digio/Twitter4SSE')
embeddings = model.encode(sentences)
print(embeddings)

Without sentence-transfomer library, please refer to this repository for detailed instructions on how to use Sentence Transformers on Huggingface.

Citing & Authors

The official paper Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings will be presented at EMNLP 2021. Further details will be available soon.

@inproceedings{di-giovanni-brambilla-2021-exploiting,
    title = "Exploiting {T}witter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings",
    author = "Di Giovanni, Marco  and
      Brambilla, Marco",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.780",
    pages = "9902--9910",
}

The official code is available on GitHub