File size: 14,710 Bytes
03fdf2c
 
 
 
 
 
 
e5c3138
 
 
 
03fdf2c
 
 
 
7c87bb8
 
 
 
 
03fdf2c
7c87bb8
 
 
 
 
 
 
 
 
 
 
 
03fdf2c
7c87bb8
 
 
 
 
 
 
 
 
 
 
 
 
 
03fdf2c
7c87bb8
 
 
 
 
 
 
 
 
 
 
 
03fdf2c
7c87bb8
 
 
 
 
 
 
 
 
 
 
 
 
 
03fdf2c
 
5026b22
03fdf2c
 
 
 
5026b22
 
03fdf2c
 
378ba42
e5c3138
 
03fdf2c
a99f762
 
378ba42
 
03fdf2c
378ba42
 
 
 
b72cc54
 
03fdf2c
 
c20ff31
 
5026b22
c20ff31
03fdf2c
88700cb
 
5026b22
 
 
88700cb
 
 
 
03fdf2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5c3138
03fdf2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5026b22
 
 
 
03fdf2c
 
 
b2c273f
03fdf2c
 
e5c3138
 
 
 
 
d22fb13
 
e5c3138
 
 
 
 
d22fb13
5026b22
03fdf2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d22fb13
 
5026b22
 
 
 
d22fb13
 
 
 
 
 
 
 
03fdf2c
 
5026b22
 
 
03fdf2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c87bb8
 
 
03fdf2c
7c87bb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- argument-mining
- Twitter
metrics:
- macro-F1
license: cc-by-sa-4.0
language:
- en
widget:
- source_sentence: >-
    The formula: Not everyone who voted Leave is racist. But everyone who's
    racist voted Leave. Not everyone who voted Leave is thick. But everyone
    who's thick voted Leave. The thick racists therefore called the shots,
    whatever the thoughts of the minority of others. #thick #Brexit
  sentences:
  - 'Men shouldn’t be making laws about women’s bodies #abortion #Texas'
  - >-
    Opinion: As the draconian (and then some) abortion law takes effecting
    #Texas, this is not an idle question for millions of Americans. A slippery
    slope towards more like-minded Republican state-legislatures to try to
    follow suit. #abortion #F24 HTTPURL
  - >-
    ’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU
    citizenship after #Brexit HTTPURL via @USER
  - '@USER Blah blah blah blah blah blah'
  example_title: Reason
- source_sentence: This is NOT good for children.
  sentences:
  - 'Men shouldn’t be making laws about women’s bodies #abortion #Texas'
  - >-
    Opinion: As the draconian (and then some) abortion law takes effecting
    #Texas, this is not an idle question for millions of Americans. A slippery
    slope towards more like-minded Republican state-legislatures to try to
    follow suit. #abortion #F24 HTTPURL
  - >-
    ’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU
    citizenship after #Brexit HTTPURL via @USER
  - '@USER Blah blah blah blah blah blah'
  example_title: Statement
- source_sentence: >-
    Elon Musk ready with 'Plan B' if Twitter rejects his offer  Read @USER Story
    | HTTPURL #ElonMusk #ElonMuskTwitter #TwitterTakeover HTTPURL
  sentences:
  - 'Men shouldn’t be making laws about women’s bodies #abortion #Texas'
  - >-
    Opinion: As the draconian (and then some) abortion law takes effecting
    #Texas, this is not an idle question for millions of Americans. A slippery
    slope towards more like-minded Republican state-legislatures to try to
    follow suit. #abortion #F24 HTTPURL
  - >-
    ’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU
    citizenship after #Brexit HTTPURL via @USER
  - '@USER Blah blah blah blah blah blah'
  example_title: Notification
- source_sentence: '@USER 👅is the Key 😂'
  sentences:
  - 'Men shouldn’t be making laws about women’s bodies #abortion #Texas'
  - >-
    Opinion: As the draconian (and then some) abortion law takes effecting
    #Texas, this is not an idle question for millions of Americans. A slippery
    slope towards more like-minded Republican state-legislatures to try to
    follow suit. #abortion #F24 HTTPURL
  - >-
    ’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU
    citizenship after #Brexit HTTPURL via @USER
  - '@USER Blah blah blah blah blah blah'
  example_title: None
base_model:
- vinai/bertweet-base
library_name: sentence-transformers
---

# WRAPresentations -- A TACO-based Embedder For Inference and Information-Driven Argument Mining on Twitter

Introducing WRAPresentations, a cutting-edge [sentence-transformers](https://www.SBERT.net) model that leverages the power of a 768-dimensional dense
vector space to map tweets according to the four classes Reason, Statement, Notification and None. This powerful model is tailored for
argument mining on Twitter, derived from the [BERTweet-base](https://huggingface.co/vinai/bertweet-base) architecture initially pre-trained on
Twitter data. Through fine-tuning with the [TACO](https://doi.org/10.5281/zenodo.8030026) dataset, WRAPresentations is effectively in encoding 
inference and information in tweets.

## Class Semantics
The TACO framework revolves around the two key elements of an argument, as defined by the [Cambridge Dictionary](https://dictionary.cambridge.org).
It encodes *inference* as *a guess that you make or an opinion that you form based on the information that you have*, and it also leverages the 
definition of *information* as *facts or details about a person, company, product, etc.*.

WRAPresentations, to some degree, captures the semantics of these critical components in its embedding space.

Consequently, it has also learned the class semantics, where inferences and information can be aggregated in relation to these distinct
classes containing these components:

* *Statement*, which refers to unique cases where only the *inference* is presented as *something that someone says or writes officially, or an action
done to express an opinion*.
* *Reason*, which represents a full argument where the *inference* is based on direct *information* mentioned in the tweet, such as a source-reference
or quotation, and thus reveals the author’s motivation *to try to understand and to make judgments based on practical facts*.
* *Notification*, which refers to a tweet that limits itself to providing *information*, such as media channels promoting their latest articles.
* *None*, a tweet that provides neither *inference* nor *information*.

In its entirety, WRAPresentations encodes the following hierarchy for tweets:

<div align="center">
  <img src="https://github.com/TomatenMarc/public-images/raw/main/Argument_Tree.svg" alt="Component Space" width="100%">
</div>

## Class Semantic Transfer to Embeddings

Observing the tweet distribution given `CLS` tokens for later classification within the embedding space of WRAPresentations, we noted that 
pre-classification fine-tuning via contrastive learning led to denser emergence of the expected class sectors compared to the embeddings of BERTweet, 
as shown in the following figure.
<div align="center">
  <img src="https://github.com/TomatenMarc/public-images/raw/main/sector_purity_coordinates.svg" alt="Argument Tree" width="100%">
</div>


## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model to generate tweet representations like this:

```python
from sentence_transformers import SentenceTransformer

tweets = ["This is an example #tweet", "Each tweet is converted"]

model = SentenceTransformer("TomatenMarc/WRAPresentations")
embeddings = model.encode(tweets)
print(embeddings)
```

<a href="https://github.com/VinAIResearch/BERTweet/blob/master/TweetNormalizer.py">
    <blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;">
        Notice: The tweets need to undergo preprocessing following the specifications for BERTweet-base.
    </blockquote>
</a>

## Usage (HuggingFace Transformers)

Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model,
then you have to apply the right pooling-operation on top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Tweets we want embeddings for
tweets = ["This is an example #tweet", "Each tweet is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("TomatenMarc/WRAPresentations")
model = AutoModel.from_pretrained("TomatenMarc/WRAPresentations")

# Tokenize sentences
encoded_input = tokenizer(tweets, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)
```

Furthermore, the WRAPresentations model is a highly suitable embedding component for `AutoModelForSequenceClassification`, enabling 
further fine-tuning of tweet classification tasks specifically for the four classes: Reason, Statement, Notification, and None.
The categorization of Reason and Statement as argument classes and Notification and None as non-argument classes is implicitly learned during 
the fine-tuning process. This setup facilitates efficient identification and analysis of argumentative content and non-argumentative content in tweets.

## Training

The WRAPresentations model underwent fine-tuning with 1,219 golden tweets from the [TACO](https://doi.org/10.5281/zenodo.8030026) dataset, covering six topics.
Five topics were chosen for optimization, representing 925 tweets (75.88%) covering #brexit (33.3%), #got (17%), #lotrrop (18.8%), #squidgame (17.1%),
and #twittertakeover (13.8%). The model used a stratified 60/40 split for training/testing on optimization data.
Additionally, 294 golden tweets (24.12%) related to the topic of #abortion were chosen as the holdout-set for final evaluation.

Before fine-tuning, we built a copy of the dataset by creating an augmentation of each tweet. The augmentation consisted of replacing all the 
topic words and entities in a tweet replaced, and then randomly masking 10% of the words in a tweet, which were then matched using
[BERTweet-base](https://huggingface.co/vinai/bertweet-base) as a `fill-mask` model. We chose to omit 10% of the words because this resulted in the 
smallest possible average cosine distance between the tweets and their augmentations of ~0.08 making augmentation during pre-classification
fine-tuning itself a regulating factor prior to any overfitting with the later test data.
During fine-tuning, we formed pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout) 
with similar or dissimilar class labels. For the training and testing set during the fine-tuning process, we utilized the augmentations, and for the 
holdout tweets, we used their original text to test the fine-tuning process and the usefulness of the augmentations towards real tweets.
For all pairs, we chose the largest possible set so that both similar and dissimilar pairs are equally represented while covering all tweets 
of the respective data split.
This process created 162,064 pairs for training and 71,812 pairs for testing. An additional 53,560 pairs were used for final evaluation with the 
holdout data. Moreover, we utilized `MEAN` pooling, enhancing sentence representations, for fine-tuning.

The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 5065 with parameters:

```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.ContrastiveLoss.ContrastiveLoss` with parameters:

  ```
  {'distance_metric': 'SiameseDistanceMetric.COSINE_DISTANCE', 'margin': 0.5, 'size_average': True}
  ```

Parameters of the fit()-Method:

```
{
    "epochs": 5,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.BinaryClassificationEvaluator.BinaryClassificationEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 4e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 2533,
    "weight_decay": 0.01
}
```

## Evaluation Results

We optimized several BERTweet models with `CLS` or `MEAN` pooling and evaluated them using the `BinaryClassificationEvaluator` of SBERT with 
standard `CLS` tokens for classification showing:


| Model                                   | Precision | Recall  | F1     | Support |
|-----------------------------------------|-----------|---------|--------|---------|
| Vanilla BERTweet-`CLS`                  | 50.00%    | 100.00% | 66.67% | 53,560  |
| Augmented BERTweet-`CLS`                | 65.69%    | 86.66%  | 74.73% | 53,560  |
| WRAPresentations-`CLS`                  | 66.00%    | 84.32%  | 74.04% | 53,560  |
| WRAPresentations-`MEAN` (current model) | 63.05%    | 88.91%  | 73.78% | 53,560  |

The outcomes for WRAPresentations-`MEAN` are influenced by the utilization of `CLS` pooling during testing, while `MEAN` pooling was employed during
fine-tuning. Despite this, employing `MEAN` pooling during the fine-tuning process still improved the `CLS` representation, particularly in terms
of recall. When WRAPresentations-`MEAN` is tested with `MEAN` pooling, the resulting F1 score stands at 74.07%.

## Full Model Architecture
<div align="center">
  <img src="https://github.com/TomatenMarc/public-images/raw/main/contrastive_siamese_network.svg" alt="Argument Tree" width="100%">
</div>

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```

# Environmental Impact

- **Hardware Type:** A100 PCIe 40GB
- **Hours used:** 2h
- **Cloud Provider:** [Google Cloud Platform](https://colab.research.google.com)
- **Compute Region:** [asia-southeast1](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones?hl=en) (Singapore)
- **Carbon Emitted:** 0.21kg CO2

# Licensing

[WRAPresentations](https://huggingface.co/TomatenMarc/WRAPresentations) © 2023 is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).

# Citation

```
@inproceedings{feger-dietze-2024-bertweets,
    title = "{BERT}weet{'}s {TACO} Fiesta: Contrasting Flavors On The Path Of Inference And Information-Driven Argument Mining On {T}witter",
    author = "Feger, Marc  and
              Dietze, Stefan",
    editor = "Duh, Kevin  and
              Gomez, Helena  and
              Bethard, Steven",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-naacl.146",
    doi = "10.18653/v1/2024.findings-naacl.146",
    pages = "2256--2266"
}
```