TomatenMarc
/

WRAPresentations

@@ -166,14 +166,14 @@ Additionally, 294 golden tweets (24.12%) related to the topic of #abortion were
 Before fine-tuning, we built a copy of the dataset by creating an augmentation of each tweet. The augmentation consisted of replacing all the
 topic words and entities in a tweet replaced, and then randomly masking 10% of the words in a tweet, which were then matched using
 [BERTweet-base](https://huggingface.co/vinai/bertweet-base) as a `fill-mask` model. We chose to omit 10% of the words because this resulted in the
-smallest possible average cosine distance between the tweets and their augmentations of 0.02, which is close to dissimilarity, making
-augmentation during pre-classification fine-tuning itself a regulating factor prior to any overfitting with the later test data.
 During fine-tuning, we formed pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout)
 with similar or dissimilar class labels. For the training and testing set during the fine-tuning process, we utilized the augmentations, and for the
 holdout tweets, we used their original text to test the fine-tuning process and the usefulness of the augmentations towards real tweets.
 For all pairs, we chose the largest possible set so that both similar and dissimilar pairs are equally represented while covering all tweets
 of the respective data split.
-This process created 307,470 pairs for training and 136,530 pairs for testing. An additional 86,142 pairs were used for final evaluation with the
 holdout data. Moreover, we utilized `MEAN` pooling, enhancing sentence representations, for fine-tuning.
 The model was trained with the parameters:
@@ -215,23 +215,20 @@ Parameters of the fit()-Method:
 ## Evaluation Results
-Following the [standard protocol](https://aclanthology.org/D17-1218.pdf) for cross-topic evaluation for argument mining, we evaluated the
-WRAPresentation model using the `BinaryClassificationEvaluator` of SBERT with standard `CLS` tokens for classification showing:
 | Model                                   | Precision | Recall  | F1     | Support |
 |-----------------------------------------|-----------|---------|--------|---------|
-| Vanilla BERTweet-`CLS`                  | 50.00%    | 100.00% | 66.67% | 86,142  |
-| Augmented BERTweet-`CLS`                | 66.75%    | 84.78%  | 74.69% | 86,142  |
-| WRAPresentations-`CLS`                  | 66.00%    | 84.32%  | 74.04% | 86,142  |
-| WRAPresentations-`MEAN` (current model) | 63.05%    | 88.91%  | 73.78% | 86,142  |
-An evaluation was conducted on previously unseen data from the holdout topic #abortion, resulting in the model achieving a passive macro-F1
-score of 73.78% when evaluated with `CLS` tokens and 74.07% F1, when evaluated with `MEAN` pooling as used for fine-tuning.
-The recall, which stands at 88.91%, indicates the model's ability to capture subtle tweet patterns and class-specific features for
-Reason, Statement, Notification, and None. As reference, we report the results for Vanilla BERTweet-`CLS`, which a plain BERTweet-base model, for
-Augmented BERTweet-`CLS`, which was trained on the same augmentations as WRAPresentations-`MEAN` but directly optimizing on the classification task, and
-WRAPresentations-`MEAN`, which is the same model as the presented model but with `CLS` pooling during fine-tuning.
 ## Full Model Architecture
 <div align="center">

 Before fine-tuning, we built a copy of the dataset by creating an augmentation of each tweet. The augmentation consisted of replacing all the
 topic words and entities in a tweet replaced, and then randomly masking 10% of the words in a tweet, which were then matched using
 [BERTweet-base](https://huggingface.co/vinai/bertweet-base) as a `fill-mask` model. We chose to omit 10% of the words because this resulted in the
+smallest possible average cosine distance between the tweets and their augmentations of ~0.08 making augmentation during pre-classification
+fine-tuning itself a regulating factor prior to any overfitting with the later test data.
 During fine-tuning, we formed pairs by matching each tweet with all remaining tweets in the same data split (training, testing, holdout)
 with similar or dissimilar class labels. For the training and testing set during the fine-tuning process, we utilized the augmentations, and for the
 holdout tweets, we used their original text to test the fine-tuning process and the usefulness of the augmentations towards real tweets.
 For all pairs, we chose the largest possible set so that both similar and dissimilar pairs are equally represented while covering all tweets
 of the respective data split.
+This process created 162,064 pairs for training and 71,812 pairs for testing. An additional 53,560 pairs were used for final evaluation with the
 holdout data. Moreover, we utilized `MEAN` pooling, enhancing sentence representations, for fine-tuning.
 The model was trained with the parameters:
 ## Evaluation Results
+We optimized several BERTweet models with `CLS` or `MEAN` pooling and evaluated them using the `BinaryClassificationEvaluator` of SBERT with
+standard `CLS` tokens for classification showing:
 | Model                                   | Precision | Recall  | F1     | Support |
 |-----------------------------------------|-----------|---------|--------|---------|
+| Vanilla BERTweet-`CLS`                  | 50.00%    | 100.00% | 66.67% | 53,560  |
+| Augmented BERTweet-`CLS`                | 65.69%    | 86.66%  | 74.73% | 53,560  |
+| WRAPresentations-`CLS`                  | 66.00%    | 84.32%  | 74.04% | 53,560  |
+| WRAPresentations-`MEAN` (current model) | 63.05%    | 88.91%  | 73.78% | 53,560  |
+The outcomes for WRAPresentations-`MEAN` are influenced by the utilization of `CLS` pooling during testing, while `MEAN` pooling was employed during
+fine-tuning. Despite this, employing `MEAN` pooling during the fine-tuning process still improved the `CLS` representation, particularly in terms
+of recall. When WRAPresentations-`MEAN` is tested with `MEAN` pooling, the resulting F1 score stands at 74.07%.
 ## Full Model Architecture
 <div align="center">