|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- argument-mining |
|
- Twitter |
|
metrics: |
|
- macro-F1 |
|
license: cc-by-sa-4.0 |
|
language: |
|
- en |
|
widget: |
|
- source_sentence: >- |
|
The formula: Not everyone who voted Leave is racist. But everyone who's |
|
racist voted Leave. Not everyone who voted Leave is thick. But everyone |
|
who's thick voted Leave. The thick racists therefore called the shots, |
|
whatever the thoughts of the minority of others. #thick #Brexit |
|
sentences: |
|
- 'Men shouldn’t be making laws about women’s bodies #abortion #Texas' |
|
- >- |
|
Opinion: As the draconian (and then some) abortion law takes effecting |
|
#Texas, this is not an idle question for millions of Americans. A slippery |
|
slope towards more like-minded Republican state-legislatures to try to |
|
follow suit. #abortion #F24 HTTPURL |
|
- >- |
|
’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU |
|
citizenship after #Brexit HTTPURL via @USER |
|
- '@USER Blah blah blah blah blah blah' |
|
example_title: Reason |
|
- source_sentence: This is NOT good for children. |
|
sentences: |
|
- 'Men shouldn’t be making laws about women’s bodies #abortion #Texas' |
|
- >- |
|
Opinion: As the draconian (and then some) abortion law takes effecting |
|
#Texas, this is not an idle question for millions of Americans. A slippery |
|
slope towards more like-minded Republican state-legislatures to try to |
|
follow suit. #abortion #F24 HTTPURL |
|
- >- |
|
’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU |
|
citizenship after #Brexit HTTPURL via @USER |
|
- '@USER Blah blah blah blah blah blah' |
|
example_title: Statement |
|
- source_sentence: >- |
|
Elon Musk ready with 'Plan B' if Twitter rejects his offer Read @USER Story |
|
| HTTPURL #ElonMusk #ElonMuskTwitter #TwitterTakeover HTTPURL |
|
sentences: |
|
- 'Men shouldn’t be making laws about women’s bodies #abortion #Texas' |
|
- >- |
|
Opinion: As the draconian (and then some) abortion law takes effecting |
|
#Texas, this is not an idle question for millions of Americans. A slippery |
|
slope towards more like-minded Republican state-legislatures to try to |
|
follow suit. #abortion #F24 HTTPURL |
|
- >- |
|
’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU |
|
citizenship after #Brexit HTTPURL via @USER |
|
- '@USER Blah blah blah blah blah blah' |
|
example_title: Notification |
|
- source_sentence: '@USER 👅is the Key 😂' |
|
sentences: |
|
- 'Men shouldn’t be making laws about women’s bodies #abortion #Texas' |
|
- >- |
|
Opinion: As the draconian (and then some) abortion law takes effecting |
|
#Texas, this is not an idle question for millions of Americans. A slippery |
|
slope towards more like-minded Republican state-legislatures to try to |
|
follow suit. #abortion #F24 HTTPURL |
|
- >- |
|
’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU |
|
citizenship after #Brexit HTTPURL via @USER |
|
- '@USER Blah blah blah blah blah blah' |
|
example_title: None |
|
base_model: |
|
- vinai/bertweet-base |
|
library_name: sentence-transformers |
|
--- |
|
|
|
# WRAPresentations -- A TACO-based Embedder For Inference and Information-Driven Argument Mining on Twitter |
|
|
|
Introducing WRAPresentations, a cutting-edge [sentence-transformers](https://www.SBERT.net) model that leverages the power of a 768-dimensional dense |
|
vector space to map tweets according to the four classes Reason, Statement, Notification and None. This powerful model is tailored for |
|
argument mining on Twitter, derived from the [BERTweet-base](https://huggingface.co/vinai/bertweet-base) architecture initially pre-trained on |
|
Twitter data. Through fine-tuning with the [TACO](https://doi.org/10.5281/zenodo.8030026) dataset, WRAPresentations is effectively in encoding |
|
inference and information in tweets. |
|
|
|
## Class Semantics |
|
The TACO framework revolves around the two key elements of an argument, as defined by the [Cambridge Dictionary](https://dictionary.cambridge.org). |
|
It encodes *inference* as *a guess that you make or an opinion that you form based on the information that you have*, and it also leverages the |
|
definition of *information* as *facts or details about a person, company, product, etc.*. |
|
|
|
WRAPresentations, to some degree, captures the semantics of these critical components in its embedding space. |
|
|
|
Consequently, it has also learned the class semantics, where inferences and information can be aggregated in relation to these distinct |
|
classes containing these components: |
|
|
|
* *Statement*, which refers to unique cases where only the *inference* is presented as *something that someone says or writes officially, or an action |
|
done to express an opinion*. |
|
* *Reason*, which represents a full argument where the *inference* is based on direct *information* mentioned in the tweet, such as a source-reference |
|
or quotation, and thus reveals the author’s motivation *to try to understand and to make judgments based on practical facts*. |
|
* *Notification*, which refers to a tweet that limits itself to providing *information*, such as media channels promoting their latest articles. |
|
* *None*, a tweet that provides neither *inference* nor *information*. |
|
|
|
In its entirety, WRAPresentations encodes the following hierarchy for tweets: |
|
|
|
<div align="center"> |
|
<img src="https://github.com/TomatenMarc/public-images/raw/main/Argument_Tree.svg" alt="Component Space" width="100%"> |
|
</div> |
|
|
|
## Class Semantic Transfer to Embeddings |
|
|
|
Observing the tweet distribution given `CLS` tokens for later classification within the embedding space of WRAPresentations, we noted that |
|
pre-classification fine-tuning via contrastive learning led to denser emergence of the expected class sectors compared to the embeddings of BERTweet, |
|
as shown in the following figure. |
|
<div align="center"> |
|
<img src="https://github.com/TomatenMarc/public-images/raw/main/sector_purity_coordinates.svg" alt="Argument Tree" width="100%"> |
|
</div> |
|
|
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model to generate tweet representations like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
tweets = ["This is an example #tweet", "Each tweet is converted"] |
|
|
|
model = SentenceTransformer("TomatenMarc/WRAPresentations") |
|
embeddings = model.encode(tweets) |
|
print(embeddings) |
|
``` |
|
|
|
<a href="https://github.com/VinAIResearch/BERTweet/blob/master/TweetNormalizer.py"> |
|
<blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;"> |
|
Notice: The tweets need to undergo preprocessing following the specifications for BERTweet-base. |
|
</blockquote> |
|
</a> |
|
|
|
## Usage (HuggingFace Transformers) |
|
|
|
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, |
|
then you have to apply the right pooling-operation on top of the contextualized word embeddings. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
|
|
# Mean Pooling - Take attention mask into account for correct averaging |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] # First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
# Tweets we want embeddings for |
|
tweets = ["This is an example #tweet", "Each tweet is converted"] |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained("TomatenMarc/WRAPresentations") |
|
model = AutoModel.from_pretrained("TomatenMarc/WRAPresentations") |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(tweets, padding=True, truncation=True, return_tensors="pt") |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
# Perform pooling. In this case, mean pooling. |
|
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) |
|
|
|
print("Sentence embeddings:") |
|
print(sentence_embeddings) |
|
``` |
|
|
|
Furthermore, the WRAPresentations model is a highly suitable embedding component for `AutoModelForSequenceClassification`, enabling |
|
further fine-tuning of tweet classification tasks specifically for the four classes: Reason, Statement, Notification, and None. |
|
The categorization of Reason and Statement as argument classes and Notification and None as non-argument classes is implicitly learned during |
|
the fine-tuning process. This setup facilitates efficient identification and analysis of argumentative content and non-argumentative content in tweets. |
|
|
|
## Training |
|
|
|
The WRAPresentations model underwent fine-tuning with 1,219 golden tweets from the [TACO](https://doi.org/10.5281/zenodo.8030026) dataset, covering six topics. |
|
Five topics were chosen for optimization, representing 925 tweets (75.88%) covering #brexit (33.3%), #got (17%), #lotrrop (18.8%), #squidgame (17.1%), |
|
and #twittertakeover (13.8%). The model used a stratified 60/40 split for training/testing on optimization data. |
|
Additionally, 294 golden tweets (24.12%) related to the topic of #abortion were chosen as the holdout-set for final evaluation. |
|
|
|
Before fine-tuning, we built a copy of the dataset by creating an augmentation of each tweet. The augmentation consisted of replacing all the |
|
topic words and entities in a tweet replaced, and then randomly masking 10% of the words in a tweet, which were then matched using |
|
[BERTweet-base](https://huggingface.co/vinai/bertweet-base) as a `fill-mask` model. We chose to omit 10% of the words because this resulted in the |
|
smallest possible average cosine distance between the tweets and their augmentations of ~0.08 making augmentation during pre-classification |
|
fine-tuning itself a regulating factor prior to any overfitting with the later test data. |
|
During fine-tuning, we formed pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout) |
|
with similar or dissimilar class labels. For the training and testing set during the fine-tuning process, we utilized the augmentations, and for the |
|
holdout tweets, we used their original text to test the fine-tuning process and the usefulness of the augmentations towards real tweets. |
|
For all pairs, we chose the largest possible set so that both similar and dissimilar pairs are equally represented while covering all tweets |
|
of the respective data split. |
|
This process created 162,064 pairs for training and 71,812 pairs for testing. An additional 53,560 pairs were used for final evaluation with the |
|
holdout data. Moreover, we utilized `MEAN` pooling, enhancing sentence representations, for fine-tuning. |
|
|
|
The model was trained with the parameters: |
|
|
|
**DataLoader**: |
|
|
|
`torch.utils.data.dataloader.DataLoader` of length 5065 with parameters: |
|
|
|
``` |
|
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
|
``` |
|
|
|
**Loss**: |
|
|
|
`sentence_transformers.losses.ContrastiveLoss.ContrastiveLoss` with parameters: |
|
|
|
``` |
|
{'distance_metric': 'SiameseDistanceMetric.COSINE_DISTANCE', 'margin': 0.5, 'size_average': True} |
|
``` |
|
|
|
Parameters of the fit()-Method: |
|
|
|
``` |
|
{ |
|
"epochs": 5, |
|
"evaluation_steps": 1000, |
|
"evaluator": "sentence_transformers.evaluation.BinaryClassificationEvaluator.BinaryClassificationEvaluator", |
|
"max_grad_norm": 1, |
|
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
|
"optimizer_params": { |
|
"lr": 4e-05 |
|
}, |
|
"scheduler": "WarmupLinear", |
|
"steps_per_epoch": null, |
|
"warmup_steps": 2533, |
|
"weight_decay": 0.01 |
|
} |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
We optimized several BERTweet models with `CLS` or `MEAN` pooling and evaluated them using the `BinaryClassificationEvaluator` of SBERT with |
|
standard `CLS` tokens for classification showing: |
|
|
|
|
|
| Model | Precision | Recall | F1 | Support | |
|
|-----------------------------------------|-----------|---------|--------|---------| |
|
| Vanilla BERTweet-`CLS` | 50.00% | 100.00% | 66.67% | 53,560 | |
|
| Augmented BERTweet-`CLS` | 65.69% | 86.66% | 74.73% | 53,560 | |
|
| WRAPresentations-`CLS` | 66.00% | 84.32% | 74.04% | 53,560 | |
|
| WRAPresentations-`MEAN` (current model) | 63.05% | 88.91% | 73.78% | 53,560 | |
|
|
|
The outcomes for WRAPresentations-`MEAN` are influenced by the utilization of `CLS` pooling during testing, while `MEAN` pooling was employed during |
|
fine-tuning. Despite this, employing `MEAN` pooling during the fine-tuning process still improved the `CLS` representation, particularly in terms |
|
of recall. When WRAPresentations-`MEAN` is tested with `MEAN` pooling, the resulting F1 score stands at 74.07%. |
|
|
|
## Full Model Architecture |
|
<div align="center"> |
|
<img src="https://github.com/TomatenMarc/public-images/raw/main/contrastive_siamese_network.svg" alt="Argument Tree" width="100%"> |
|
</div> |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) |
|
) |
|
``` |
|
|
|
# Environmental Impact |
|
|
|
- **Hardware Type:** A100 PCIe 40GB |
|
- **Hours used:** 2h |
|
- **Cloud Provider:** [Google Cloud Platform](https://colab.research.google.com) |
|
- **Compute Region:** [asia-southeast1](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones?hl=en) (Singapore) |
|
- **Carbon Emitted:** 0.21kg CO2 |
|
|
|
# Licensing |
|
|
|
[WRAPresentations](https://huggingface.co/TomatenMarc/WRAPresentations) © 2023 is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1). |
|
|
|
# Citation |
|
|
|
``` |
|
@inproceedings{feger-dietze-2024-bertweets, |
|
title = "{BERT}weet{'}s {TACO} Fiesta: Contrasting Flavors On The Path Of Inference And Information-Driven Argument Mining On {T}witter", |
|
author = "Feger, Marc and |
|
Dietze, Stefan", |
|
editor = "Duh, Kevin and |
|
Gomez, Helena and |
|
Bethard, Steven", |
|
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024", |
|
month = jun, |
|
year = "2024", |
|
address = "Mexico City, Mexico", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.findings-naacl.146", |
|
doi = "10.18653/v1/2024.findings-naacl.146", |
|
pages = "2256--2266" |
|
} |
|
``` |