TomatenMarc commited on
Commit
d22fb13
1 Parent(s): aa7cac0

Upload 19 files

Browse files
Files changed (1) hide show
  1. README.md +13 -16
README.md CHANGED
@@ -166,14 +166,14 @@ Additionally, 294 golden tweets (24.12%) related to the topic of #abortion were
166
  Before fine-tuning, we built a copy of the dataset by creating an augmentation of each tweet. The augmentation consisted of replacing all the
167
  topic words and entities in a tweet replaced, and then randomly masking 10% of the words in a tweet, which were then matched using
168
  [BERTweet-base](https://huggingface.co/vinai/bertweet-base) as a `fill-mask` model. We chose to omit 10% of the words because this resulted in the
169
- smallest possible average cosine distance between the tweets and their augmentations of 0.02, which is close to dissimilarity, making
170
- augmentation during pre-classification fine-tuning itself a regulating factor prior to any overfitting with the later test data.
171
  During fine-tuning, we formed pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout)
172
  with similar or dissimilar class labels. For the training and testing set during the fine-tuning process, we utilized the augmentations, and for the
173
  holdout tweets, we used their original text to test the fine-tuning process and the usefulness of the augmentations towards real tweets.
174
  For all pairs, we chose the largest possible set so that both similar and dissimilar pairs are equally represented while covering all tweets
175
  of the respective data split.
176
- This process created 307,470 pairs for training and 136,530 pairs for testing. An additional 86,142 pairs were used for final evaluation with the
177
  holdout data. Moreover, we utilized `MEAN` pooling, enhancing sentence representations, for fine-tuning.
178
 
179
  The model was trained with the parameters:
@@ -215,23 +215,20 @@ Parameters of the fit()-Method:
215
 
216
  ## Evaluation Results
217
 
218
- Following the [standard protocol](https://aclanthology.org/D17-1218.pdf) for cross-topic evaluation for argument mining, we evaluated the
219
- WRAPresentation model using the `BinaryClassificationEvaluator` of SBERT with standard `CLS` tokens for classification showing:
220
 
221
 
222
  | Model | Precision | Recall | F1 | Support |
223
  |-----------------------------------------|-----------|---------|--------|---------|
224
- | Vanilla BERTweet-`CLS` | 50.00% | 100.00% | 66.67% | 86,142 |
225
- | Augmented BERTweet-`CLS` | 66.75% | 84.78% | 74.69% | 86,142 |
226
- | WRAPresentations-`CLS` | 66.00% | 84.32% | 74.04% | 86,142 |
227
- | WRAPresentations-`MEAN` (current model) | 63.05% | 88.91% | 73.78% | 86,142 |
228
-
229
- An evaluation was conducted on previously unseen data from the holdout topic #abortion, resulting in the model achieving a passive macro-F1
230
- score of 73.78% when evaluated with `CLS` tokens and 74.07% F1, when evaluated with `MEAN` pooling as used for fine-tuning.
231
- The recall, which stands at 88.91%, indicates the model's ability to capture subtle tweet patterns and class-specific features for
232
- Reason, Statement, Notification, and None. As reference, we report the results for Vanilla BERTweet-`CLS`, which a plain BERTweet-base model, for
233
- Augmented BERTweet-`CLS`, which was trained on the same augmentations as WRAPresentations-`MEAN` but directly optimizing on the classification task, and
234
- WRAPresentations-`MEAN`, which is the same model as the presented model but with `CLS` pooling during fine-tuning.
235
 
236
  ## Full Model Architecture
237
  <div align="center">
 
166
  Before fine-tuning, we built a copy of the dataset by creating an augmentation of each tweet. The augmentation consisted of replacing all the
167
  topic words and entities in a tweet replaced, and then randomly masking 10% of the words in a tweet, which were then matched using
168
  [BERTweet-base](https://huggingface.co/vinai/bertweet-base) as a `fill-mask` model. We chose to omit 10% of the words because this resulted in the
169
+ smallest possible average cosine distance between the tweets and their augmentations of ~0.08 making augmentation during pre-classification
170
+ fine-tuning itself a regulating factor prior to any overfitting with the later test data.
171
  During fine-tuning, we formed pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout)
172
  with similar or dissimilar class labels. For the training and testing set during the fine-tuning process, we utilized the augmentations, and for the
173
  holdout tweets, we used their original text to test the fine-tuning process and the usefulness of the augmentations towards real tweets.
174
  For all pairs, we chose the largest possible set so that both similar and dissimilar pairs are equally represented while covering all tweets
175
  of the respective data split.
176
+ This process created 162,064 pairs for training and 71,812 pairs for testing. An additional 53,560 pairs were used for final evaluation with the
177
  holdout data. Moreover, we utilized `MEAN` pooling, enhancing sentence representations, for fine-tuning.
178
 
179
  The model was trained with the parameters:
 
215
 
216
  ## Evaluation Results
217
 
218
+ We optimized several BERTweet models with `CLS` or `MEAN` pooling and evaluated them using the `BinaryClassificationEvaluator` of SBERT with
219
+ standard `CLS` tokens for classification showing:
220
 
221
 
222
  | Model | Precision | Recall | F1 | Support |
223
  |-----------------------------------------|-----------|---------|--------|---------|
224
+ | Vanilla BERTweet-`CLS` | 50.00% | 100.00% | 66.67% | 53,560 |
225
+ | Augmented BERTweet-`CLS` | 65.69% | 86.66% | 74.73% | 53,560 |
226
+ | WRAPresentations-`CLS` | 66.00% | 84.32% | 74.04% | 53,560 |
227
+ | WRAPresentations-`MEAN` (current model) | 63.05% | 88.91% | 73.78% | 53,560 |
228
+
229
+ The outcomes for WRAPresentations-`MEAN` are influenced by the utilization of `CLS` pooling during testing, while `MEAN` pooling was employed during
230
+ fine-tuning. Despite this, employing `MEAN` pooling during the fine-tuning process still improved the `CLS` representation, particularly in terms
231
+ of recall. When WRAPresentations-`MEAN` is tested with `MEAN` pooling, the resulting F1 score stands at 74.07%.
 
 
 
232
 
233
  ## Full Model Architecture
234
  <div align="center">