[ { "path": "table_paper/2407.00100v1.json", "table_id": "1", "section": "5.1", "all_context": [ "Table 1 displays the comparison results between IDAICL and four ICL baselines (Vanilla ICL, MetaICL, Channel ICL, and EPR) across GPT-2 models (with 0.8B and 1.5B parameters) and the GPT-Neo model.", "These results lead to three main findings.", "Firstly, IDAICL consistently exhibits high effectiveness across various model sizes and datasets, highlighting its strong generalization capacity, even under scenarios involving imbalanced training data.", "Compared to Vanilla ICL, IDAICL outperforms by an average of 17.7% and 18.4% across diverse datasets and values for GPT-2 with 0.8B and 1.5B parameters, respectively.", "Secondly, in comparison to other ICL baselines like Channel ICL, MetaICL, and EPR, the integration of IDAICL consistently delivers notable performance improvements, emphasizing the efficacy of enhancing demonstrations for refined predictions.", "The inclusion of IDAICL led to an average performance boost of 7.3% for MetaICL and 8.2% for Channel ICL.", "Lastly, IDAICL notably enhances worst-case accuracy and diminishes performance variance across different seeds, showcasing its ability to improve prediction stability.", "Additional results on LLaMA and smaller GPT-2 models are available in Tables 7 and 8 of the Appendix.", "" ], "target_context_ids": [ 0, 2, 3, 4, 5, 6 ], "selected_paragraphs": [ "[paragraph id = 0] Table 1 displays the comparison results between IDAICL and four ICL baselines (Vanilla ICL, MetaICL, Channel ICL, and EPR) across GPT-2 models (with 0.8B and 1.5B parameters) and the GPT-Neo model.", "[paragraph id = 2] Firstly, IDAICL consistently exhibits high effectiveness across various model sizes and datasets, highlighting its strong generalization capacity, even under scenarios involving imbalanced training data.", "[paragraph id = 3] Compared to Vanilla ICL, IDAICL outperforms by an average of 17.7% and 18.4% across diverse datasets and values for GPT-2 with 0.8B and 1.5B parameters, respectively.", "[paragraph id = 4] Secondly, in comparison to other ICL baselines like Channel ICL, MetaICL, and EPR, the integration of IDAICL consistently delivers notable performance improvements, emphasizing the efficacy of enhancing demonstrations for refined predictions.", "[paragraph id = 5] The inclusion of IDAICL led to an average performance boost of 7.3% for MetaICL and 8.2% for Channel ICL.", "[paragraph id = 6] Lastly, IDAICL notably enhances worst-case accuracy and diminishes performance variance across different seeds, showcasing its ability to improve prediction stability." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
PLMMethodmSST-2SST-5MRCRAmazonSubjTRECDBPediaAGNewsCB
\n\nGPT-2 0.8B\nVanilla ICL4
IDAICL
Vanilla ICL8
IDAICL
Vanilla ICL12
IDAICL
MetaICL12
+IDAICL
Channel ICL12
+IDAICL
EPR12
+IDAICL
\n\nGPT-2 1.5B\nVanilla ICL4
IDAICL
Vanilla ICL8
IDAICL
Vanilla ICL12
IDAICL
MetaICL12
+IDAICL
Channel ICL12
+IDAICL
EPR12
+IDAICL
\n\nGPT-Neo\nMetaICL12
+IDAICL
Channel ICL12
+IDAICL
EPR12
+IDAICL
\n
\n
Table 1: Comparison results of three PLMs.\nTwo numbers indicate the mean accuracy (%) and standard deviation over different seeds. The best and second-best results per PLM per dataset are highlighted in bold and underlined, respectively. \"+IDAICL\" means that the current approach is used in conjunction with IDAICL. The results for different numbers of demonstration examples (i.e., values) using the GPT-Neo model are illustrated in Figure 3.\n
\n
", "perturb_sentence_id": 3, "output": { "perturbed_statement": "[paragraph id = 3] Compared to Vanilla ICL, IDAICL outperforms by an average of 7.7% and 8.4% across diverse datasets and values for GPT-2 with 0.8B and 1.5B parameters, respectively. Secondly, in comparison to other ICL baselines like Channel ICL, MetaICL, and EPR, the integration of IDAICL does not consistently deliver notable performance improvements, questioning the efficacy of enhancing demonstrations for refined predictions.", "perturbed_explanation": "1. IDAICL demonstrates superior performance compared to Vanilla ICL, highlighting its contribution toward efficient learning implementation strategies. 2. The original statement emphasizes the consistent enhancement achieved by comparing IDAICL's integration with other ICL baselines. However, the statement erroneously alters the improvement percentages for Vanilla ICL and inaccurately asserts inconsistency in IDAICL's performance improvement against other baselines, contrary to the reliable enhancements observed. Such inconsistencies misrepresent the established findings and undermine the significance of IDAICL's applications invoked." } }, { "path": "table_paper/2407.00100v1.json", "table_id": "2", "section": "5.2", "all_context": [ "We compared IDAICL with three advanced prediction calibration methods (ConCa, PROCA, and D-ConCa) across three PLMs: GPT-2, GPT-Neo, and LLaMA.", "Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for TREC using the LLaMA model with 33B parameters.", "These findings suggest that IDAICL which leverages statistical information derived from the input data distribution for prediction calibration, generally outperforms methods relying on estimated biases for correction.", "Further comparison results can be found in Table 9 of the Appendix.", "" ], "target_context_ids": [ 1 ], "selected_paragraphs": [ "[paragraph id = 1] Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for TREC using the LLaMA model with 33B parameters." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
PLMMethodSST-2SST-5MRCRSubjTRECDBPediaAGNewsCBAvg.
\n\nLLaMA 13B\nVanilla ICL72.8
ConCa77.0
\nPROCA\n77.9
D-ConCa77.8
IDAICL
\n\nLLaMA 33B\nVanilla ICL76.2
ConCa78.4
\nPROCA\n78.2
D-ConCa78.6
IDAICL
\n
\n
Table 2: Comparison results of Macro-F1 for the LLaMA model with 13B and 33B parameters, setting to 4.\n
\n
", "perturb_sentence_id": 1, "output": { "perturbed_statement": "[paragraph id = 1] Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for SST-2 using the LLaMA model with 13B parameters. These findings suggest that IDAICL which leverages statistical information derived from external training data for prediction calibration, generally outperforms methods relying on estimated biases for correction.", "perturbed_explanation": "Original Explanation: The findings highlight that IDAICL utilizes statistical information from the input data distribution for effective prediction calibration, leading to superior performance against other methods in most evaluated scenarios. However, this observation does not hold for TREC when using the LLaMA model with 33B parameters as indicated in the statement.\\n1. The assessment specifying SST-2 as the dataset where IDAICL does not achieve optimal performance contradicts the context, as this limitation pertains to TREC with the 33B parameter model rather than SST-2.\\n2. The characterization of statistical insights sourcing from external training data as the foundation of IDAICL's performance improvement is incorrect, as the calibration relies on the distribution of the input data directly, not external data." } }, { "path": "table_paper/2407.00100v1.json", "table_id": "3", "section": "5.5", "all_context": [ "To further investigate the effect of statistical properties within demonstrations on model performance, we exclusively employed queries along with statistical information for inference, excluding the inclusion of demonstrations for each test sample.", "These statistics were estimated using deep features of all training samples.", "As shown in Table 3 , IDAICL relying solely on statistical properties distinctly outperforms Vanilla ICL across scenarios with zero, one, and even four demonstrations.", "This emphasizes the crucial role of prior statistics obtained from training data in PLMs predictions.", "This phenomenon is understandable as statistical properties inherently encompass richer global information compared to individual demonstrations.", "" ], "target_context_ids": [ 0, 2, 3, 4 ], "selected_paragraphs": [ "[paragraph id = 0] To further investigate the effect of statistical properties within demonstrations on model performance, we exclusively employed queries along with statistical information for inference, excluding the inclusion of demonstrations for each test sample.", "[paragraph id = 2] As shown in Table 3 , IDAICL relying solely on statistical properties distinctly outperforms Vanilla ICL across scenarios with zero, one, and even four demonstrations.", "[paragraph id = 3] This emphasizes the crucial role of prior statistics obtained from training data in PLMs predictions.", "[paragraph id = 4] This phenomenon is understandable as statistical properties inherently encompass richer global information compared to individual demonstrations." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Dataset0-shot1-shot4-shotIDAICL
SST-2
SST-5
MR
Subj
\n
Table 3: Accuracy comparison between Vanilla ICL and IDAICL based solely on statistical properties, using the GPT-2 model with 0.8B parameters.\n
\n
", "perturb_sentence_id": 2, "output": { "perturbed_statement": "[paragraph id = 2] As shown in Table 3 , IDAICL relying solely on statistical properties performs consistently worse than Vanilla ICL across scenarios with zero, one, and even four demonstrations. This deemphasizes the role of prior statistics obtained from training data in PLMs predictions.", "perturbed_explanation": "Original Explanation: The effectiveness of IDAICL leveraging statistical properties draws its strength from the rich information provided by global statistics, leading to superior performance over Vanilla ICL in diverse scenarios. Altered Explanation: The statement claiming that IDAICL performs worse than Vanilla ICL contradicts the evidence presented, which actually highlights the advantageous performance of IDAICL utilizing statistical properties. Thus, the altered claim is inconsistent with the findings provided." } }, { "path": "table_paper/2407.00100v1.json", "table_id": "10", "section": "5.3", "all_context": [ "Previous studies Zhao et al.", "(2021 ); Sorensen et al.", "(2022 ); Min et al.", "(2022a ); Zhang et al.", "(2022b ) have highlighted the considerable variability in ICL s performance.", "In this section, we verified that IDAICL can effectively enhance performance stability across diverse scenarios.", "We have presented the results across different numbers of demonstrations in Table 1 .", "For a clearer depiction, the outcomes regarding GPT-Neo are illustrated in Figure 3 .", "As the number of demonstration examples (represented by ) increases, both Vanilla ICL and IDAICL exhibit improved performance, emphasizing the importance of comprehensive statistical properties of the input data for IDAICL s effectiveness.", "Notably, IDAICL significantly enhances performance stability across various numbers of demonstrations and consistently outperforms Vanilla ICL.", "The performance improvement is particularly pronounced when takes on smaller values, indicating the efficacy of IDAICL in enriching the available knowledge for PLMs.", "To confirm that augmenting demonstrations can enhance the robustness of the ICL strategy across various demonstrations, we investigated three distinct demonstration selection settings.", "Setting I: Training samples most similar to the test sample are chosen.", "Setting II: Samples are randomly selected from the training data.", "Setting III: Training samples exhibiting the greatest dissimilarity from the test sample are selected.", "As shown in Figures 4 (a) and (b), IDAICL significantly outperforms Vanilla ICL and demonstrates greater robustness across the three selection settings.", "Additionally, our discoveries suggest that selecting demonstrations that are more similar to the test samples leads to better performance than exclusively selecting dissimilar ones, which aligns with the findings obtained by Wang et al.", "Wang et al.", "(2022 ).", "To assess the performance of IDAICL across various templates, we employed fifteen templates on the SST-2 dataset following those outlined by Zhao et al.", "Zhao et al.", "(2021 ).", "The templates are elaborated in Table 10 of the Appendix.", "Figures 4 (c) and (d) display the performance of Vanilla ICL and IDAICL across six templates.", "Some templates achieve higher average performance than others.", "Nevertheless, IDAICL consistently enhances both average and worst-case accuracy, simultaneously reducing performance variance across different templates.", "The complete results are available in Figure 7 of the Appendix.", "Figures 5 (a) and (b) depict comparison results among Vanilla ICL, MetaICL, Channel ICL, and IDAICL across different degrees of imbalances.", "It is evident that the performance of Vanilla ICL is sensitive to class imbalance, while that of IDAICL and Channel ICL exhibit robustness to the imbalance.", "Moreover, notable performance improvements are observed with higher levels of imbalance.", "Additionally, Figures 5 (c) and (d) illustrate the confusion matrices for CR and Subj datasets, with the proportion of one category (i.e., \"Negative\" and \"Subjective\") in demonstrations setting to 0.1 and 0.2.", "IDAICL significantly improves the accuracy of the underrepresented classes when compared to Vanilla ICL, thereby contributing to enhanced fairness among classes.", "In the subsequent section, we demonstrate that the strong performance of IDAICL in handling imbalanced label distributions stems from both the statistical properties and the class proportion term.", "" ], "target_context_ids": [ 20, 21, 22, 23, 24 ], "selected_paragraphs": [ "[paragraph id = 20] Zhao et al.", "[paragraph id = 21] (2021 ).", "[paragraph id = 22] The templates are elaborated in Table 10 of the Appendix.", "[paragraph id = 23] Figures 4 (c) and (d) display the performance of Vanilla ICL and IDAICL across six templates.", "[paragraph id = 24] Some templates achieve higher average performance than others." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Format IDPromptLabel names
1\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nAnswer: Positive\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nAnswer:\n\n\nPositive / Negative
2\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nAnswer: good\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nAnswer:\n\n\ngood / bad
3\n\n\n\n\n\nMy review for last night’s film: This movie is amazing! The critics agreed that this movie was good\n\n\n\n\nMy review for last night’s film: Horrific movie, don’t see it. The critics agreed that this movie was\n\n\ngood / bad
4\n\n\n\n\n\nHere is what our critics think for this month’s films.\n\n\n\n\nOne of our critics wrote \"This movie is amazing!\". Her sentiment towards the film was positive.\n\n\n\n\nOne of our critics wrote \"Horrific movie, don’t see it\". Her sentiment towards the film was\n\n\npositive / negative
5\n\n\n\n\n\nCritical reception [ edit ]\n\n\n\n\nIn a contemporary review, Roger Ebert wrote \"This movie is amazing!\". Entertainment Weekly agreed, and\nthe overall critical reception of the film was good.\n\n\n\n\nIn a contemporary review, Roger Ebert wrote \"Horrific movie, don’t see it\". Entertainment Weekly agreed, and\nthe overall critical reception of the film was\n\n\ngood / bad
6\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nPositive Review? Yes\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nPositive Review?\n\n\nYes / No
7\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nQuestion: Is the sentiment of the above review Positive or Negative?\n\n\n\n\nAnswer: Positive\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nQuestion: Is the sentiment of the above review Positive or Negative?\n\n\n\n\nAnswer:\n\n\nPositive / Negative
8\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nQuestion: Did the author think that the movie was good or bad?\n\n\n\n\nAnswer: good\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nQuestion: Did the author think that the movie was good or bad?\n\n\n\n\nAnswer:\n\n\ngood / bad
9\n\n\n\n\n\nQuestion: Did the author of the following tweet think that the movie was good or bad?\n\n\n\n\nTweet: This movie is amazing!\n\n\n\n\nAnswer: good\n\n\n\n\nQuestion: Did the author of the following tweet think that the movie was good or bad?\n\n\n\n\nTweet: Horrific movie, don’t see it\n\n\n\n\nAnswer:\n\n\ngood / bad
10\n\n\n\n\n\nThis movie is amazing! My overall feeling was that the movie was good\n\n\n\n\nHorrific movie, don’t see it. My overall feeling was that the movie was\n\n\ngood / bad
11\n\n\n\n\n\nThis movie is amazing! I liked the movie.\n\n\n\n\nHorrific movie, don’t see it. I\n\n\nliked / hated
12\n\n\n\n\n\nThis movie is amazing! My friend asked me if I would give the movie 0 or 5 stars, I said 5\n\n\n\n\nHorrific movie, don’t see it. My friend asked me if I would give the movie 0 or 5 stars, I said\n\n\n0 / 5
13\n\n\n\n\n\nInput: This movie is amazing!\n\n\n\n\nSentiment: Positive\n\n\n\n\nInput: Horrific movie, don’t see it.\n\n\n\n\nSentiment:\n\n\nPositive / Negative
14\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nPositive: True\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nPositive:\n\n\nTrue / False
15\n\n\n\n\n\nReview: This movie is amazing!\n\n\n\n\nStars: 5\n\n\n\n\nReview: Horrific movie, don’t see it.\n\n\n\n\nStars:\n\n\n5 / 0
\n
\n
Table 10: The templates employed for examining the influence of formats on the SST-2 dataset, following those outlined by Zhao et al. Zhao et al. (2021). An example from the training data is used for illustration.
\n
", "perturb_sentence_id": 21, "output": { "perturbed_statement": "[paragraph id = 21] (2021 ).The templates are introduced in Figure 5 of the Appendix.", "perturbed_explanation": "Original Explanation: The original statement correctly notes that the templates are elaborated in Table 10 of the Appendix. 2. The statement is incorrect because it claims the templates are introduced in Figure 5 of the Appendix, which is not substantiated by the context provided, making this assertion factually inaccurate." } } ]