Edit model card

TRoTR-all-distilroberta-v1

FrancescoPeriti/TRoTR-distiluse-base-multilingual-cased-v1 is a fine-tuned version of the sentence-transformers/distiluse-base-multilingual-cased-v1.

NOTE: In our work, we performed cross-validation across 10 different folds. For a given model (e.g., distiluse-base-multilingual-cased-v1), this process involved fine-tuning 10 separate models and reporting the average performance across the test folds. Rather than sharing all the fine-tuned models for each fold, we decided to provide only an example model for the FOLD1. Please note that the results in the paper are based on the averaged performance across all folds. Therefore, the performance of this single model is not directly comparable to the results reported in the paper.

You can find more details in our paper TRoTR: A Framework for Evaluating the Recontextualization of Text by Francesco Periti, Pierluigi Cassotti, Stefano Montanelli, Nina Tahmasebi, and Dominik Schlechtweg. The repository of our project is https://github.com/FrancescoPeriti/TRoTR.

Model Description

This model is designed to evaluate the topic relatedness of text reuse in different contexts.

The model is fine-tuned on the TRoTR dataset for text recontextualization using contrastive learning. Specifically, given a target text-reuse excerpt 𝑡 within two contexts 𝑐₁ and 𝑐₂, the model is trained to minimize the embedding distance between 𝑐₁ and 𝑐₂ if they share the same topic, and to maximize the distance if they don't share the same topic.

As an example, consider three recontextualizations of the biblical passage John 15:13:

  • (1) It’s the wonderful pride month!! ❤️🧡💛💚💙💜 Honestly pride is everyday! Love is love don’t forget I love you ❤️. Remember this! John 15:12-13: “My command is this: Love each other as I have loved you. Greater love has no one than this: to lay down one’s life for one’s friends
  • (2) At a large Crimean event today Putin quoted the Bible to defend the special military operation in Ukraine which has killed thousands and displaced millions. His words “There is no greater love than if someone gives soul for their friends”. And people were cheering him. Madness!!!
  • (3) “Freeing people from genocide is the reason, motive & goal of the military operation we started in the Donbas& Ukraine”, Putin says, then quotes the Bible: “There is no greater love than to lay down one’s life for one’s friends.” It’s like Billy Graham meets North Korea

In this example, the biblical passage is incorporated within three texts with different topic recontextualizations. In particular, the text (1) has a different topic with respect to text (2) and (3), while the texts (2) and (3) are topic related

How to Get Started with the Model

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('FrancescoPeriti/TRoTR-distiluse-base-multilingual-cased-v1')

# Example sentences for text recontextualization
context1 = "It's the wonderful pride month!! ❤️🧡💛💚💙💜 Honestly pride is everyday! Love is love don't forget I love you ❤️. Remember this! John 15:12-13: My command is this: Love each other as I have loved you. Greater love has no one than this: to lay down one's life for one's friends"
context2 = "At a large Crimean event today Putin quoted the Bible to defend the special military operation in Ukraine which has killed thousands and displaced millions. His words \"Greater love has no one than this: to lay down one's life for one's friends\". And people were cheering him. Madness!!!"
context3 = "\"Freeing people from genocide is the reason, motive and goal of the military operation we started in the Donbas and Ukraine\", Putin says, then quotes the Bible: \"Greater love has no one than this: to lay down one's life for one's friends\" It's like Billy Graham meets North Korea."

# Encode the two contexts into embeddings
embedding1 = model.encode([context1])
embedding2 = model.encode([context2])
embedding3 = model.encode([context3])

# Calculate similarity
similarity1 = model.similarity(embedding1, embedding2)
similarity2 = model.similarity(embedding1, embedding3)
similarity3 = model.similarity(embedding2, embedding3)

# Print the similarity score
print(f"Cosine similarities between the contexts: {similarity1}, {similarity2}, {similarity3}")
# Cosine similarities between the contexts: tensor([[0.4249]]), tensor([[0.4724]]), tensor([[0.8182]])

Citation

Francesco Periti, Pierluigi Cassotti, Stefano Montanelli, Nina Tahmasebi, and Dominik Schlechtweg. 2024. TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13972–13990, Miami, Florida, USA. Association for Computational Linguistics.

BibTeX:

@inproceedings{periti2024trotr,
    title = {{TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse}},
    author = "Periti, Francesco  and Cassotti, Pierluigi  and Montanelli, Stefano  and Tahmasebi, Nina  and Schlechtweg, Dominik",
    editor = "Al-Onaizan, Yaser  and Bansal, Mohit  and Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.774",
    pages = "13972--13990",
    abstract = "Current approaches for detecting text reuse do not focus on recontextualization, i.e., how the new context(s) of a reused text differs from its original context(s). In this paper, we propose a novel framework called TRoTR that relies on the notion of topic relatedness for evaluating the diachronic change of context in which text is reused. TRoTR includes two NLP tasks: TRiC and TRaC. TRiC is designed to evaluate the topic relatedness between a pair of recontextualizations. TRaC is designed to evaluate the overall topic variation within a set of recontextualizations. We also provide a curated TRoTR benchmark of biblical text reuse, human-annotated with topic relatedness. The benchmark exhibits an inter-annotator agreement of .811. We evaluate multiple, established SBERT models on the TRoTR tasks and find that they exhibit greater sensitivity to textual similarity than topic relatedness. Our experiments show that fine-tuning these models can mitigate such a kind of sensitivity.",
}
Downloads last month
13
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for FrancescoPeriti/TRoTR-distiluse-base-multilingual-cased-v1

Dataset used to train FrancescoPeriti/TRoTR-distiluse-base-multilingual-cased-v1