[ { "path": "table_paper/2407.00100v1.json", "table_id": "1", "section": "5.1", "all_context": [ "Table 1 displays the comparison results between IDAICL and four ICL baselines (Vanilla ICL, MetaICL, Channel ICL, and EPR) across GPT-2 models (with 0.8B and 1.5B parameters) and the GPT-Neo model.", "These results lead to three main findings.", "Firstly, IDAICL consistently exhibits high effectiveness across various model sizes and datasets, highlighting its strong generalization capacity, even under scenarios involving imbalanced training data.", "Compared to Vanilla ICL, IDAICL outperforms by an average of 17.7% and 18.4% across diverse datasets and values for GPT-2 with 0.8B and 1.5B parameters, respectively.", "Secondly, in comparison to other ICL baselines like Channel ICL, MetaICL, and EPR, the integration of IDAICL consistently delivers notable performance improvements, emphasizing the efficacy of enhancing demonstrations for refined predictions.", "The inclusion of IDAICL led to an average performance boost of 7.3% for MetaICL and 8.2% for Channel ICL.", "Lastly, IDAICL notably enhances worst-case accuracy and diminishes performance variance across different seeds, showcasing its ability to improve prediction stability.", "Additional results on LLaMA and smaller GPT-2 models are available in Tables 7 and 8 of the Appendix.", "" ], "target_context_ids": [ 0, 2, 3, 4, 5, 6 ], "selected_paragraphs": [ "[paragraph id = 0] Table 1 displays the comparison results between IDAICL and four ICL baselines (Vanilla ICL, MetaICL, Channel ICL, and EPR) across GPT-2 models (with 0.8B and 1.5B parameters) and the GPT-Neo model.", "[paragraph id = 2] Firstly, IDAICL consistently exhibits high effectiveness across various model sizes and datasets, highlighting its strong generalization capacity, even under scenarios involving imbalanced training data.", "[paragraph id = 3] Compared to Vanilla ICL, IDAICL outperforms by an average of 17.7% and 18.4% across diverse datasets and values for GPT-2 with 0.8B and 1.5B parameters, respectively.", "[paragraph id = 4] Secondly, in comparison to other ICL baselines like Channel ICL, MetaICL, and EPR, the integration of IDAICL consistently delivers notable performance improvements, emphasizing the efficacy of enhancing demonstrations for refined predictions.", "[paragraph id = 5] The inclusion of IDAICL led to an average performance boost of 7.3% for MetaICL and 8.2% for Channel ICL.", "[paragraph id = 6] Lastly, IDAICL notably enhances worst-case accuracy and diminishes performance variance across different seeds, showcasing its ability to improve prediction stability." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
PLMMethodmSST-2SST-5MRCRAmazonSubjTRECDBPediaAGNewsCB
\n\nGPT-2 0.8B\nVanilla ICL4
IDAICL
Vanilla ICL8
IDAICL
Vanilla ICL12
IDAICL
MetaICL12
+IDAICL
Channel ICL12
+IDAICL
EPR12
+IDAICL
\n\nGPT-2 1.5B\nVanilla ICL4
IDAICL
Vanilla ICL8
IDAICL
Vanilla ICL12
IDAICL
MetaICL12
+IDAICL
Channel ICL12
+IDAICL
EPR12
+IDAICL
\n\nGPT-Neo\nMetaICL12
+IDAICL
Channel ICL12
+IDAICL
EPR12
+IDAICL
\n
\n
Table 1: Comparison results of three PLMs.\nTwo numbers indicate the mean accuracy (%) and standard deviation over different seeds. The best and second-best results per PLM per dataset are highlighted in bold and underlined, respectively. \"+IDAICL\" means that the current approach is used in conjunction with IDAICL. The results for different numbers of demonstration examples (i.e., values) using the GPT-Neo model are illustrated in Figure 3.\n
\n
", "perturb_sentence_id": [ 3, 4 ], "output": { "perturbed_statement": "[paragraph id = 3] Compared to Vanilla ICL, IDAICL outperforms by an average of 17.7% and 18.4% across diverse datasets and values for GPT-2 with 0.8B and 1.5B parameters, respectively. Secondly, in comparison to other ICL baselines like MetaICL, MetaICL, and EPR, the integration of IDAICL consistently delivers notable performance improvements, emphasizing the efficacy of enhancing demonstrations for refined predictions.", "perturbed_explanation": "Original Explanation: The original statement highlights IDAICL's superior performance over Vanilla ICL by providing specific percentage improvements for both 0.8B and 1.5B parameter models. It further reiterates IDAICL's effectiveness in comparison with other ICL baselines, including Channel ICL, MetaICL, and EPR. 1. The statement incorrectly repeats MetaICL instead of mentioning the correct baseline, Channel ICL, after MetaICL. This error falsely indicates that MetaICL is being compared to itself, misrepresenting the diversity of the ICL baselines used for comparison." } }, { "path": "table_paper/2407.00100v1.json", "table_id": "2", "section": "5.2", "all_context": [ "We compared IDAICL with three advanced prediction calibration methods (ConCa, PROCA, and D-ConCa) across three PLMs: GPT-2, GPT-Neo, and LLaMA.", "Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for TREC using the LLaMA model with 33B parameters.", "These findings suggest that IDAICL which leverages statistical information derived from the input data distribution for prediction calibration, generally outperforms methods relying on estimated biases for correction.", "Further comparison results can be found in Table 9 of the Appendix.", "" ], "target_context_ids": [ 1 ], "selected_paragraphs": [ "[paragraph id = 1] Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for TREC using the LLaMA model with 33B parameters." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
PLMMethodSST-2SST-5MRCRSubjTRECDBPediaAGNewsCBAvg.
\n\nLLaMA 13B\nVanilla ICL72.8
ConCa77.0
\nPROCA\n77.9
D-ConCa77.8
IDAICL
\n\nLLaMA 33B\nVanilla ICL76.2
ConCa78.4
\nPROCA\n78.2
D-ConCa78.6
IDAICL
\n
\n
Table 2: Comparison results of Macro-F1 for the LLaMA model with 13B and 33B parameters, setting to 4.\n
\n
", "perturb_sentence_id": [ 1, 2 ], "output": { "perturbed_statement": "[paragraph id = 1] Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for the IMDB using the LLaMA model with 33B parameters. These findings suggest that IDAICL which leverages statistical information derived from the input data distribution for prediction calibration, generally outperforms methods relying on estimated biases for correction.", "perturbed_explanation": "These findings suggest that IDAICL which leverages statistical information derived from the input data distribution for prediction calibration, generally outperforms methods relying on estimated biases for correction. However, the statement incorrectly says that IDAICL did not achieve state-of-the-art performance for the IMDB dataset, when it actually refers to the TREC dataset. This change alters the specific dataset where IDAICL did not reach state-of-the-art performance, making the statement factually incorrect." } }, { "path": "table_paper/2407.00100v1.json", "table_id": "3", "section": "5.5", "all_context": [ "To further investigate the effect of statistical properties within demonstrations on model performance, we exclusively employed queries along with statistical information for inference, excluding the inclusion of demonstrations for each test sample.", "These statistics were estimated using deep features of all training samples.", "As shown in Table 3 , IDAICL relying solely on statistical properties distinctly outperforms Vanilla ICL across scenarios with zero, one, and even four demonstrations.", "This emphasizes the crucial role of prior statistics obtained from training data in PLMs predictions.", "This phenomenon is understandable as statistical properties inherently encompass richer global information compared to individual demonstrations.", "" ], "target_context_ids": [ 0, 2, 3, 4 ], "selected_paragraphs": [ "[paragraph id = 0] To further investigate the effect of statistical properties within demonstrations on model performance, we exclusively employed queries along with statistical information for inference, excluding the inclusion of demonstrations for each test sample.", "[paragraph id = 2] As shown in Table 3 , IDAICL relying solely on statistical properties distinctly outperforms Vanilla ICL across scenarios with zero, one, and even four demonstrations.", "[paragraph id = 3] This emphasizes the crucial role of prior statistics obtained from training data in PLMs predictions.", "[paragraph id = 4] This phenomenon is understandable as statistical properties inherently encompass richer global information compared to individual demonstrations." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Dataset0-shot1-shot4-shotIDAICL
SST-2
SST-5
MR
Subj
\n
Table 3: Accuracy comparison between Vanilla ICL and IDAICL based solely on statistical properties, using the GPT-2 model with 0.8B parameters.\n
\n
", "perturb_sentence_id": [ 2, 3 ], "output": { "perturbed_statement": "[paragraph id = 2] As shown in Table 3 , IDAICL relying solely on statistical properties generally performs worse than Vanilla ICL across scenarios with zero, one, and even four demonstrations. This de-emphasizes the crucial role of prior statistics obtained from training data in PLMs predictions.", "perturbed_explanation": "The original explanation states that relying on statistical properties allows IDAICL to distinctly outperform Vanilla ICL in various scenarios, highlighting the importance of prior statistics in PLM predictions. 1. The statement claims that IDAICL relying solely on statistical properties performs worse than Vanilla ICL, which contradicts the original context that suggests IDAICL actually outperforms Vanilla ICL in those scenarios. 2. The conclusion that this de-emphasizes the role of prior statistics is incorrect because the original context indicates the importance of prior statistics in enhancing performance, not diminishing it." } }, { "path": "table_paper/2407.00100v1.json", "table_id": "10", "section": "5.3", "all_context": [ "Previous studies Zhao et al.", "(2021 ); Sorensen et al.", "(2022 ); Min et al.", "(2022a ); Zhang et al.", "(2022b ) have highlighted the considerable variability in ICL s performance.", "In this section, we verified that IDAICL can effectively enhance performance stability across diverse scenarios.", "We have presented the results across different numbers of demonstrations in Table 1 .", "For a clearer depiction, the outcomes regarding GPT-Neo are illustrated in Figure 3 .", "As the number of demonstration examples (represented by ) increases, both Vanilla ICL and IDAICL exhibit improved performance, emphasizing the importance of comprehensive statistical properties of the input data for IDAICL s effectiveness.", "Notably, IDAICL significantly enhances performance stability across various numbers of demonstrations and consistently outperforms Vanilla ICL.", "The performance improvement is particularly pronounced when takes on smaller values, indicating the efficacy of IDAICL in enriching the available knowledge for PLMs.", "To confirm that augmenting demonstrations can enhance the robustness of the ICL strategy across various demonstrations, we investigated three distinct demonstration selection settings.", "Setting I: Training samples most similar to the test sample are chosen.", "Setting II: Samples are randomly selected from the training data.", "Setting III: Training samples exhibiting the greatest dissimilarity from the test sample are selected.", "As shown in Figures 4 (a) and (b), IDAICL significantly outperforms Vanilla ICL and demonstrates greater robustness across the three selection settings.", "Additionally, our discoveries suggest that selecting demonstrations that are more similar to the test samples leads to better performance than exclusively selecting dissimilar ones, which aligns with the findings obtained by Wang et al.", "Wang et al.", "(2022 ).", "To assess the performance of IDAICL across various templates, we employed fifteen templates on the SST-2 dataset following those outlined by Zhao et al.", "Zhao et al.", "(2021 ).", "The templates are elaborated in Table 10 of the Appendix.", "Figures 4 (c) and (d) display the performance of Vanilla ICL and IDAICL across six templates.", "Some templates achieve higher average performance than others.", "Nevertheless, IDAICL consistently enhances both average and worst-case accuracy, simultaneously reducing performance variance across different templates.", "The complete results are available in Figure 7 of the Appendix.", "Figures 5 (a) and (b) depict comparison results among Vanilla ICL, MetaICL, Channel ICL, and IDAICL across different degrees of imbalances.", "It is evident that the performance of Vanilla ICL is sensitive to class imbalance, while that of IDAICL and Channel ICL exhibit robustness to the imbalance.", "Moreover, notable performance improvements are observed with higher levels of imbalance.", "Additionally, Figures 5 (c) and (d) illustrate the confusion matrices for CR and Subj datasets, with the proportion of one category (i.e., \"Negative\" and \"Subjective\") in demonstrations setting to 0.1 and 0.2.", "IDAICL significantly improves the accuracy of the underrepresented classes when compared to Vanilla ICL, thereby contributing to enhanced fairness among classes.", "In the subsequent section, we demonstrate that the strong performance of IDAICL in handling imbalanced label distributions stems from both the statistical properties and the class proportion term.", "" ], "target_context_ids": [ 20, 21, 22, 23, 24 ], "selected_paragraphs": [ "[paragraph id = 20] Zhao et al.", "[paragraph id = 21] (2021 ).", "[paragraph id = 22] The templates are elaborated in Table 10 of the Appendix.", "[paragraph id = 23] Figures 4 (c) and (d) display the performance of Vanilla ICL and IDAICL across six templates.", "[paragraph id = 24] Some templates achieve higher average performance than others." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Format IDPromptLabel names
1\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nAnswer: Positive\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nAnswer:\n\n\nPositive / Negative
2\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nAnswer: good\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nAnswer:\n\n\ngood / bad
3\n\n\n\n\n\nMy review for last night’s film: This movie is amazing! The critics agreed that this movie was good\n\n\n\n\nMy review for last night’s film: Horrific movie, don’t see it. The critics agreed that this movie was\n\n\ngood / bad
4\n\n\n\n\n\nHere is what our critics think for this month’s films.\n\n\n\n\nOne of our critics wrote \"This movie is amazing!\". Her sentiment towards the film was positive.\n\n\n\n\nOne of our critics wrote \"Horrific movie, don’t see it\". Her sentiment towards the film was\n\n\npositive / negative
5\n\n\n\n\n\nCritical reception [ edit ]\n\n\n\n\nIn a contemporary review, Roger Ebert wrote \"This movie is amazing!\". Entertainment Weekly agreed, and\nthe overall critical reception of the film was good.\n\n\n\n\nIn a contemporary review, Roger Ebert wrote \"Horrific movie, don’t see it\". Entertainment Weekly agreed, and\nthe overall critical reception of the film was\n\n\ngood / bad
6\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nPositive Review? Yes\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nPositive Review?\n\n\nYes / No
7\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nQuestion: Is the sentiment of the above review Positive or Negative?\n\n\n\n\nAnswer: Positive\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nQuestion: Is the sentiment of the above review Positive or Negative?\n\n\n\n\nAnswer:\n\n\nPositive / Negative
8\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nQuestion: Did the author think that the movie was good or bad?\n\n\n\n\nAnswer: good\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nQuestion: Did the author think that the movie was good or bad?\n\n\n\n\nAnswer:\n\n\ngood / bad
9\n\n\n\n\n\nQuestion: Did the author of the following tweet think that the movie was good or bad?\n\n\n\n\nTweet: This movie is amazing!\n\n\n\n\nAnswer: good\n\n\n\n\nQuestion: Did the author of the following tweet think that the movie was good or bad?\n\n\n\n\nTweet: Horrific movie, don’t see it\n\n\n\n\nAnswer:\n\n\ngood / bad
10\n\n\n\n\n\nThis movie is amazing! My overall feeling was that the movie was good\n\n\n\n\nHorrific movie, don’t see it. My overall feeling was that the movie was\n\n\ngood / bad
11\n\n\n\n\n\nThis movie is amazing! I liked the movie.\n\n\n\n\nHorrific movie, don’t see it. I\n\n\nliked / hated
12\n\n\n\n\n\nThis movie is amazing! My friend asked me if I would give the movie 0 or 5 stars, I said 5\n\n\n\n\nHorrific movie, don’t see it. My friend asked me if I would give the movie 0 or 5 stars, I said\n\n\n0 / 5
13\n\n\n\n\n\nInput: This movie is amazing!\n\n\n\n\nSentiment: Positive\n\n\n\n\nInput: Horrific movie, don’t see it.\n\n\n\n\nSentiment:\n\n\nPositive / Negative
14\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nPositive: True\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nPositive:\n\n\nTrue / False
15\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nStars: 5\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nStars:\n\n\n5 / 0
\n
\n
Table 10: The templates employed for examining the influence of formats on the SST-2 dataset, following those outlined by Zhao et al. Zhao et al. (2021). An example from the training data is used for illustration.
\n
", "perturb_sentence_id": [ 21, 22 ], "output": { "perturbed_statement": "[paragraph id = 21] (2021). The templates are presented in Table 5 of the Appendix.", "perturbed_explanation": "1. The original statement indicates that the templates are elaborated in Table 10 of the Appendix. 2. The statement incorrectly mentions Table 5 instead of Table 10, which changes the reference location of the templates in the appendix, making it factually incorrect." } } ]