File size: 12,455 Bytes
03fdf2c ed635c5 03fdf2c b72cc54 03fdf2c b72cc54 03fdf2c c20ff31 03fdf2c 977abd8 b72cc54 03fdf2c b72cc54 03fdf2c 977abd8 b72cc54 977abd8 03fdf2c ed635c5 03fdf2c ed635c5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 |
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
license: cc-by-sa-4.0
language:
- en
widget:
- source_sentence: "The formula: Not everyone who voted Leave is racist. But everyone who's racist voted Leave. Not everyone who voted Leave is thick. But everyone who's thick voted Leave. The thick racists therefore called the shots, whatever the thoughts of the minority of others. #thick #Brexit"
sentences:
- "Men shouldn’t be making laws about women’s bodies #abortion #Texas"
- "Opinion: As the draconian (and then some) abortion law takes effecting #Texas, this is not an idle question for millions of Americans. A slippery slope towards more like-minded Republican state-legislatures to try to follow suit. #abortion #F24 HTTPURL"
- "’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU citizenship after #Brexit HTTPURL via @USER"
- "@USER Blah blah blah blah blah blah"
example_title: "Reason"
- source_sentence: "This is NOT good for children."
sentences:
- "Men shouldn’t be making laws about women’s bodies #abortion #Texas"
- "Opinion: As the draconian (and then some) abortion law takes effecting #Texas, this is not an idle question for millions of Americans. A slippery slope towards more like-minded Republican state-legislatures to try to follow suit. #abortion #F24 HTTPURL"
- "’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU citizenship after #Brexit HTTPURL via @USER"
- "@USER Blah blah blah blah blah blah"
example_title: "Statement"
- source_sentence: "Elon Musk ready with 'Plan B' if Twitter rejects his offer Read @USER Story | HTTPURL #ElonMusk #ElonMuskTwitter #TwitterTakeover HTTPURL"
sentences:
- "Men shouldn’t be making laws about women’s bodies #abortion #Texas"
- "Opinion: As the draconian (and then some) abortion law takes effecting #Texas, this is not an idle question for millions of Americans. A slippery slope towards more like-minded Republican state-legislatures to try to follow suit. #abortion #F24 HTTPURL"
- "’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU citizenship after #Brexit HTTPURL via @USER"
- "@USER Blah blah blah blah blah blah"
example_title: "Notification"
- source_sentence: "@USER 👅is the Key 😂"
sentences:
- "Men shouldn’t be making laws about women’s bodies #abortion #Texas"
- "Opinion: As the draconian (and then some) abortion law takes effecting #Texas, this is not an idle question for millions of Americans. A slippery slope towards more like-minded Republican state-legislatures to try to follow suit. #abortion #F24 HTTPURL"
- "’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU citizenship after #Brexit HTTPURL via @USER"
- "@USER Blah blah blah blah blah blah"
example_title: "None"
---
# WRAPresentations
Introducing WRAPresentations, a cutting-edge [sentence-transformers](https://www.SBERT.net) model that leverages the power of a 768-dimensional dense
vector space to map tweets according to the four classes Reason, Statement, Notification and None. This powerful model is tailored for
argument mining on Twitter, derived from the [BERTweet-base](https://huggingface.co/vinai/bertweet-base) architecture initially pre-trained on
Twitter data. Through fine-tuning with the [TACO](https://doi.org/10.5281/zenodo.8030026) dataset, WRAPresentations is effectively in Weaving
Relevant Argument Properties (WRAP) into the embedding space.
## Class Semantics
WRAPresentations, to some degree, captures the semantics of the critical components of an argument (inference and information), as defined by the
[Cambridge Dictionary](https://dictionary.cambridge.org).
It encodes *inference* as *a guess that one makes or an opinion formed based on available information*, and it also leverages the definition of
*information* as *facts or details about a person, company, product, etc.*.
Consequently, it has also learned the semantics of:
* *Statement*, which refers to unique cases where only the *inference* is presented as *something that someone says or writes officially, or an
action done to express an opinion*.
* *Reason*, which represents a full argument where the *inference* is based on direct *information* mentioned in the tweet, such as a
source-reference or quotation, and thus reveals the author’s motivation *to try to understand and to make judgments based on practical facts*.
* *Notification*, which refers to a tweet that limits itself to providing *information*, such as media channels promoting their latest articles.
* *None*, a tweet that provides neither *inference* nor *information*.
In its entirety, WRAPresentations encodes the following hierarchy for tweets:
<div align="center">
<img src="https://github.com/TomatenMarc/public-images/raw/main/Argument_Tree.svg" alt="Argument Tree" width="100%">
</div>
## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model to generate tweet representations like this:
```python
from sentence_transformers import SentenceTransformer
tweets = ["This is an example #tweet", "Each tweet is converted"]
model = SentenceTransformer("TomatenMarc/WRAPresentations")
embeddings = model.encode(tweets)
print(embeddings)
```
<a href="https://github.com/VinAIResearch/BERTweet/blob/master/TweetNormalizer.py">
<blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;">
Notice: The tweets need to undergo preprocessing following the specifications for BERTweet-base.
</blockquote>
</a>
## Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model,
then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Tweets we want embeddings for
tweets = ["This is an example #tweet", "Each tweet is converted"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("TomatenMarc/WRAPresentations")
model = AutoModel.from_pretrained("TomatenMarc/WRAPresentations")
# Tokenize sentences
encoded_input = tokenizer(tweets, padding=True, truncation=True, return_tensors="pt")
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print("Sentence embeddings:")
print(sentence_embeddings)
```
Furthermore, the WRAPresentations model is a highly suitable embedding component for `AutoModelForSequenceClassification`, enabling fine-tuning of
tweet classification tasks specifically for the four classes: Reason, Statement, Notification, and None. The categorization of Reason and Statement as
argument classes and Notification and None as non-argument classes is implicitly learned during the fine-tuning process. This setup facilitates
efficient identification and analysis of argumentative content and non-argumentative content in tweets.
## Training
The WRAPresentations model underwent fine-tuning with 1,219 golden tweets from the TACO dataset, covering six topics.
Five topics were chosen for optimization, representing 925 tweets (75.88%) covering #brexit (33.3%), #got (17%), #lotrrop (18.8%), #squidgame (17.1%),
and #twittertakeover (13.8%). The model used a stratified 60/40 split for training/testing on optimization data.
Additionally, 294 golden tweets (24.12%) related to the topic of abortion were chosen as the holdout-set for final evaluation.
During fine-tuning, we formed tweet pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout) with
similar or dissimilar class labels. This process created 307,470 pairs for training and 136,530 pairs for testing. An additional 86,142 pairs were
used for final evaluation with the holdout data.
The model was trained with the parameters:
**DataLoader**:
`torch.utils.data.dataloader.DataLoader` of length 5065 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```
**Loss**:
`sentence_transformers.losses.ContrastiveLoss.ContrastiveLoss` with parameters:
```
{'distance_metric': 'SiameseDistanceMetric.COSINE_DISTANCE', 'margin': 0.5, 'size_average': True}
```
Parameters of the fit()-Method:
```
{
"epochs": 5,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.BinaryClassificationEvaluator.BinaryClassificationEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 4e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 2533,
"weight_decay": 0.01
}
```
## Evaluation Results
Following the [standard protocol](https://aclanthology.org/D17-1218.pdf) for cross-topic evaluation for argument mining, we evaluated the
WRAPresentation model using the
`BinaryClassificationEvaluator` showing:
| Model | Accuracy | Precision | Recall | F1 | Support |
|---------------------|----------|-----------|--------|--------|---------|
| vinai/bertweet-base | 60.62% | 50.08% | 99.89% | 66.71% | 86,142 |
| WRAPresentations | 72.25% | 65.45% | 88.21% | 75.14% | 86,142 |
An evaluation was conducted on previously unseen data from the holdout topic abortion, resulting in the model achieving a sophisticated macro
F1-score of 75.14%. The recall, which stands at 88.21%, indicates the model's ability to capture subtle tweet patterns and class-specific features for
Reason, Statement, Notification, and None. Despite having a lower precision of 65.45%, the model's primary focus is on prioritizing recall to capture
relevant instances. Fine-tuning precision can be addressed in a subsequent classification phase, when using this model
for `AutoModelForSequenceClassification`. In contrast, the baseline model (*vinai/bertweet-base*) achieved an exceptional recall of 99.89%, but it
comes with a precision trade-off (50.08%), possibly indicating overfitting. However, WRAPresentations demonstrated its ability to effectively
distinguish between tweets of the argument framework, capturing intra-class semantics while discerning inter-class semantics.
This is indicated by its better F1 score of 75.14%, showcasing a superior balance between recall and precision.
As a result, WRAPresentations proves to be more suitable for argument mining on Twitter, as it achieves a more reliable performance in identifying
relevant instances in the data.
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```
# Environmental Impact
- **Hardware Type:** A100 PCIe 40GB
- **Hours used:** 2h
- **Cloud Provider:** [Google Cloud Platform](https://colab.research.google.com)
- **Compute Region:** [asia-southeast1](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones?hl=en) (Singapore)
- **Carbon Emitted:** 0.21kg CO2
## Licensing
[WRAPresentations](https://huggingface.co/TomatenMarc/WRAPresentations) © 2023 by [Marc Feger](mailto:marc.feger@uni-duesseldorf.de) is licensed
under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).
## Contact
Please contact [marc.feger@uni-duesseldorf.de](mailto:marc.feger@uni-duesseldorf.de). |