TomatenMarc
commited on
Commit
•
f588e9f
1
Parent(s):
65f00a2
Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,184 @@
|
|
1 |
---
|
2 |
-
license: cc-by-nc-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
metrics:
|
6 |
+
- f1
|
7 |
+
pipeline_tag: text-classification
|
8 |
+
tags:
|
9 |
+
- transformers
|
10 |
+
- argument-mining
|
11 |
+
- opinion-mining
|
12 |
+
- information-extraction
|
13 |
+
- inference-extraction
|
14 |
+
- Twitter
|
15 |
+
widget:
|
16 |
+
- text: "Men shouldn’t be making laws about women’s bodies #abortion #Texas"
|
17 |
+
example_title: "Statement"
|
18 |
+
- text: "’Bitter truth’: EU chief pours cold water on idea of Brits keeping EU citizenship after #Brexit HTTPURL via @USER"
|
19 |
+
example_title: "Notification"
|
20 |
+
- text: "Opinion: As the draconian (and then some) abortion law takes effect in #Texas, this is not an idle question for millions of Americans. A slippery slope towards more like-minded Republican state legislatures to try to follow suit. #abortion #F24 HTTPURL"
|
21 |
+
example_title: "Reason"
|
22 |
+
- text: "@USER Blah blah blah blah blah blah"
|
23 |
+
example_title: "None"
|
24 |
+
- text: "republican men and karens make me sick"
|
25 |
+
example_title: "Unlabeled 1"
|
26 |
+
- text: "No empire lives forever! Historical fact! GodWins! 🙏💪🇺🇲"
|
27 |
+
example_title: "Unlabeled 2"
|
28 |
+
- text: "Further author information regarding registration and visa support letters will be sent to the authors soon. #CIKM2023"
|
29 |
+
example_title: "Unlabeled 3"
|
30 |
+
- text: "Ummmmmm"
|
31 |
+
example_title: "Unlabeled 4"
|
32 |
+
- text: "whoever says that The Last Jedi is a good movie is lying or trolling everyone"
|
33 |
+
example_title: "Unlabeled 5"
|
34 |
+
- text: "I don’t think people realize how big this story is GBI Strategies, the group paid $11M+ by Biden PACs to harvest fraudulent voter registrations in *20 states*, may be the root source of Democrat election rigging @USER may have just exposed their entire fraud machine HTTPURL"
|
35 |
+
example_tite: "Unlabeled 6"
|
36 |
---
|
37 |
+
|
38 |
+
# WRAP -- A Content Management System for Twitter
|
39 |
+
|
40 |
+
Introducing WRAP, an advanced classification model built upon `AutoModelForSequenceClassification`, designed to identify tweets belonging to four
|
41 |
+
distinct classes: Reason, Statement, Notification, and None of the [TACO dataset](https://anonymous.4open.science/r/TACO).
|
42 |
+
Designed specifically for extracting information and inferences from Twitter data, this specialized classification model utilizes
|
43 |
+
[WRAPresentations](https://huggingface.co/TomatenMarc/WRAPresentations), from which WRAP acquires its name.
|
44 |
+
WRAPresentations is an advancement of the [BERTweet-base](https://huggingface.co/vinai/bertweet-base) architecture, whose pre-training was
|
45 |
+
extended on augmented tweets using contrastive learning.
|
46 |
+
|
47 |
+
## Class Semantics
|
48 |
+
|
49 |
+
The TACO framework revolves around the two key elements of an argument, as defined by the [Cambridge Dictionary](https://dictionary.cambridge.org).
|
50 |
+
It encodes *inference* as *a guess that you make or an opinion that you form based on the information that you have*, and it also leverages the
|
51 |
+
definition of *information* as *facts or details about a person, company, product, etc.*.
|
52 |
+
|
53 |
+
Taken together, WRAP can identify specific classes of tweets, where inferences and information can be aggregated in relation to these distinct
|
54 |
+
classes containing these components:
|
55 |
+
|
56 |
+
* *Statement*, which refers to unique cases where only the *inference* is presented as *something that someone says or writes officially, or an action
|
57 |
+
done to express an opinion* (see ex. 1).
|
58 |
+
* *Reason*, which represents a full argument where the *inference* is based on direct *information* mentioned in the tweet, such as a source-reference
|
59 |
+
or quotation, and thus reveals the author’s motivation *to try to understand and to make judgments based on practical facts* (see ex. 3).
|
60 |
+
* *Notification*, which refers to a tweet that limits itself to providing *information*, such as media channels promoting their latest articles
|
61 |
+
(see ex. 2).
|
62 |
+
* *None*, a tweet that provides neither *inference* nor *information* (see ex. 4).
|
63 |
+
|
64 |
+
In its entirety, WRAP can classify the following hierarchy for tweets:
|
65 |
+
|
66 |
+
<div align="center">
|
67 |
+
<img src="https://github.com/TomatenMarc/public-images/raw/main/Component_Space_WRAP.svg" alt="Component Space" width="100%">
|
68 |
+
</div>
|
69 |
+
|
70 |
+
## Usage
|
71 |
+
|
72 |
+
Using this model becomes easy when you have `transformers` installed:
|
73 |
+
|
74 |
+
```python
|
75 |
+
pip install - U transformers
|
76 |
+
```
|
77 |
+
|
78 |
+
Then you can use the model to generate tweet classifications like this:
|
79 |
+
|
80 |
+
```python
|
81 |
+
from transformers import pipeline
|
82 |
+
|
83 |
+
pipe = pipeline("text-classification", model="TomatenMarc/WRAP")
|
84 |
+
prediction = pipe("Huggingface is awesome")
|
85 |
+
|
86 |
+
print(prediction)
|
87 |
+
```
|
88 |
+
|
89 |
+
<a href="https://github.com/TomatenMarc/TACO/blob/main/notebooks/classifier_cv.ipynb">
|
90 |
+
<blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;">
|
91 |
+
Notice: The tweets need to undergo preprocessing before classification.
|
92 |
+
</blockquote>
|
93 |
+
</a>
|
94 |
+
|
95 |
+
## Training
|
96 |
+
|
97 |
+
The final model underwent training using the entire shuffled ground truth dataset known as TACO, encompassing a total of 1734 tweets.
|
98 |
+
This dataset showcases the distribution of topics as: #abortion (25.9%), #brexit (29.0%), #got (11.0%), #lotrrop (12.1%), #squidgame (12.7%), and
|
99 |
+
#twittertakeover (9.3%). For training, we utilized [SimpleTransformers](https://simpletransformers.ai).
|
100 |
+
|
101 |
+
Additionally, the category and class distribution of the dataset TACO is as follows:
|
102 |
+
|
103 |
+
| Inference | No-Inference |
|
104 |
+
|--------------|--------------|
|
105 |
+
| 865 (49.88%) | 869 (50.12%) |
|
106 |
+
|
107 |
+
| Information | No-Information |
|
108 |
+
|---------------|----------------|
|
109 |
+
| 1081 (62.34%) | 653 (37.66%) |
|
110 |
+
|
111 |
+
| Reason | Statement | Notification | None |
|
112 |
+
|--------------|--------------|--------------|--------------|
|
113 |
+
| 581 (33.50%) | 284 (16.38%) | 500 (28.84%) | 369 (21.28%) |
|
114 |
+
|
115 |
+
<p>
|
116 |
+
<blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;">
|
117 |
+
Notice: Our training involved TACO to forecast class predictions, where the categories (Information/Inference) represent class aggregations
|
118 |
+
based on the inference or information component.
|
119 |
+
</blockquote>
|
120 |
+
<p>
|
121 |
+
|
122 |
+
### Dataloader
|
123 |
+
|
124 |
+
```
|
125 |
+
"data_loader": {
|
126 |
+
"type": "torch.utils.data.dataloader.DataLoader",
|
127 |
+
"args": {
|
128 |
+
"batch_size": 8,
|
129 |
+
"sampler": "torch.utils.data.sampler.RandomSampler"
|
130 |
+
}
|
131 |
+
}
|
132 |
+
```
|
133 |
+
|
134 |
+
Parameters of the fit()-Method:
|
135 |
+
|
136 |
+
```
|
137 |
+
{
|
138 |
+
"epochs": 5,
|
139 |
+
"max_grad_norm": 1,
|
140 |
+
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
|
141 |
+
"optimizer_params": {
|
142 |
+
"lr": 4e-05
|
143 |
+
},
|
144 |
+
"scheduler": "WarmupLinear",
|
145 |
+
"warmup_steps": 66,
|
146 |
+
"weight_decay": 0.06
|
147 |
+
}
|
148 |
+
```
|
149 |
+
|
150 |
+
## Evaluation
|
151 |
+
|
152 |
+
We applied a 6-fold (In-Topic) cross-validation method to demonstrate WRAP's optimal performance. This involved using the same dataset and parameters
|
153 |
+
described in the *Training* section, where we trained on k-1 splits and made predictions using the kth split.
|
154 |
+
|
155 |
+
Additionally, we assessed its ability to generalize across the 6 topics (Cross-Topic) of TACO. Each of the k topics was utilized for testing, while
|
156 |
+
the remaining k-1 topics were used for training purposes.
|
157 |
+
|
158 |
+
In total, the WRAP classifier performs as follows:
|
159 |
+
|
160 |
+
### Content Management
|
161 |
+
|
162 |
+
| Macro-F1 | Inference | Information | Multiclass |
|
163 |
+
|-------------|-----------|-------------|------------|
|
164 |
+
| In-Topic | 87.71% | 85.34% | 75.80% |
|
165 |
+
| Cross-Topic | 86.71% | 84.59% | 73.92% |
|
166 |
+
|
167 |
+
### Classification
|
168 |
+
|
169 |
+
| Macro-F1 | Reason | Statement | Notification | None |
|
170 |
+
|-------------|--------|-----------|--------------|--------|
|
171 |
+
| In-Topic | 77.82% | 61.10% | 80.56% | 83.71% |
|
172 |
+
| Cross-Topic | 76.52% | 58.99% | 78.43% | 81.73% |
|
173 |
+
|
174 |
+
# Environmental Impact
|
175 |
+
|
176 |
+
- **Hardware Type:** A100 PCIe 40GB
|
177 |
+
- **Hours used:** 10 min
|
178 |
+
- **Cloud Provider:** [Google Cloud Platform](https://colab.research.google.com)
|
179 |
+
- **Compute Region:** [asia-southeast1](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones?hl=en) (Singapore)
|
180 |
+
- **Carbon Emitted:** 0.02kg CO2
|
181 |
+
|
182 |
+
## Licensing
|
183 |
+
|
184 |
+
[WRAP](https://huggingface.co/TomatenMarc/WRAP) © 2023 is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1)
|