[
    {
        "path": "table_paper/2407.00071v1.json",
        "table_id": "1",
        "section": "2.2",
        "all_context": [
            "There have been many papers that suggest that LLMs can indeed reason (?",
            "For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.",
            "However, these results are not a good indicator for the autonomous reasoning capabilities of the model.",
            "In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.",
            "The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts.",
            "Thus we believe the next milestone for LLMs is automatic prompt generation with correct reasoning.",
            "The main inspiration for our work comes from Yan LeCun s review (?)",
            "which suggests multiple models need to work together to emulate general intelligence and that human brain possibly calculates a “cost function” for reasoning in a gradient-free manner - similar to combinatorial optimization.",
            ""
        ],
        "target_context_ids": [
            1,
            3,
            4
        ],
        "selected_paragraphs": [
            "[paragraph id = 1] For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.",
            "[paragraph id = 3] In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.",
            "[paragraph id = 4] The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts."
        ],
        "table_html": "<figure class=\"ltx_table\" id=\"S2.T1\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S2.T1.1\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S2.T1.1.1.1\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_tt\" id=\"S2.T1.1.1.1.1\"></th>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S2.T1.1.1.1.2\">Gemini Ultra</td>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S2.T1.1.1.1.3\">GPT-4</td>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S2.T1.1.1.1.4\">LLama3 70B</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.2.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.2.2.1\">MMLU</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.2.2.2\">90.04% CoT@32</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.2.2.3\">86.4% 5-shot</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.2.2.4\">79.5% 5-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.3.3\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S2.T1.1.3.3.1\">GSM8K</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.3.3.2\">94.4% Maj1@32</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.3.3.3\">92% 5-Shot CoT</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.3.3.4\">93.0 8-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.4.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S2.T1.1.4.4.1\">MATH</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.4.4.2\">53.2% 4-shot</td>\n<td class=\"ltx_td ltx_border_t\" id=\"S2.T1.1.4.4.3\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.4.4.4\">50.4 4-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.5.5\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S2.T1.1.5.5.1\">BIG-Bench-Hard</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.5.5.2\">83.6% 3-shot</td>\n<td class=\"ltx_td ltx_border_t\" id=\"S2.T1.1.5.5.3\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.5.5.4\">81.3 3-shot, CoT</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.6.6\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.6.6.1\">DROP</th>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.6.6.2\">82.4% Variable shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.6.6.3\">80.9 3-shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.6.6.4\">79.7 3-shot,F1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.7.7\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.7.7.1\">HellaSwag</th>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.7.7.2\">87.8% 10-shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.7.7.3\">95.3% 10-shot</td>\n<td class=\"ltx_td\" id=\"S2.T1.1.7.7.4\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.8.8\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.8.8.1\">WinoGrande</th>\n<td class=\"ltx_td\" id=\"S2.T1.1.8.8.2\"></td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.8.8.3\">87.5% 5-shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.8.8.4\">81.3% 5-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.9.9\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S2.T1.1.9.9.1\">AI2 Reasoning</th>\n<td class=\"ltx_td ltx_border_bb\" id=\"S2.T1.1.9.9.2\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb\" id=\"S2.T1.1.9.9.3\">96.3% 25-shot</td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb\" id=\"S2.T1.1.9.9.4\">93% 25-shot</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span>Summary of recent reasoning benchmarks on LLMs (<span class=\"ltx_text ltx_font_bold\" id=\"S2.T1.3.1\">?</span>). Note that reported results are all dependent on some form of In Context Learning or Chain of Thought (CoT) </figcaption>\n</figure>",
        "perturb_sentence_id": [
            1,
            2
        ],
        "output": {
            "perturbed_statement": "[paragraph id = 1] For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever declining results. However, these results are not a good indicator for the autonomous reasoning capabilities of the model.",
            "perturbed_explanation": "Original Explanation: The statement highlights that the improvements in reasoning benchmarks do not necessarily reflect the model's autonomous reasoning abilities. These tasks are often structured in a manner that allows for improvement through specific techniques like few-shot learning or manual prompt engineering, which may not correlate with true reasoning capabilities. \n2. The statement is incorrect because it suggests that the benchmarks show ever declining results. In fact, the context implies that improvements in benchmarks are generally observed with newer revisions of language models, not a decline."
        }
    },
    {
        "path": "table_paper/2407.00071v1.json",
        "table_id": "2",
        "section": "4",
        "all_context": [
            "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.",
            "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.",
            "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).",
            "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.",
            "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.",
            "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).",
            "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.",
            "was determined empirically on test questions.",
            "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.",
            "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
            "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
            "- to select the optimal values for the other four parameters.",
            "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.",
            "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)",
            "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
            "Figure 2 and Table 3 displays our results for BBH tasks.",
            "We manually evaluated the results for CR and zero-shot.",
            "The USP results are taken from (?).",
            "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?",
            "We performed a human evaluation at each stage of the CR pipeline.",
            "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
            "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
            "More results of the human evaluation can be found in the Appendix.",
            ""
        ],
        "target_context_ids": [
            9,
            10,
            11
        ],
        "selected_paragraphs": [
            "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
            "[paragraph id = 10] On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
            "[paragraph id = 11] - to select the optimal values for the other four parameters."
        ],
        "table_html": "<figure class=\"ltx_table\" id=\"S4.T2\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S4.T2.5\" style=\"width:433.6pt;height:56.9pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(79.5pt,-10.4pt) scale(1.57928347350645,1.57928347350645) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T2.5.5\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T2.5.5.5\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.5.5.5.6\" style=\"padding-bottom:2.15277pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.5.5.5.6.1\">Parameter</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.1.1.1.1\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.2.2.2.2\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.3.3.3.3\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.4.4.4.4\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T2.5.5.5.5\" style=\"padding-bottom:2.15277pt;\"></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T2.5.5.6.1\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.5.5.6.1.1.1\">Tuning Range</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.2\">[1E-3, 70]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.3\">[1E-4, 20]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.4\">[-2, 10]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.5\">[1, 4]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_t\" id=\"S4.T2.5.5.6.1.6\">2</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 2: </span>Parameter Tuning Ranges</figcaption>\n</figure>",
        "perturb_sentence_id": [
            9,
            10
        ],
        "output": {
            "perturbed_statement": "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 10 questions from across all of BBH to form a 135 question tuning set. On this, we set the ranges for the tuning (see Table 2) and use Optuna - a gradient free hyperparameter optimization framework (?)",
            "perturbed_explanation": "1. The original statement indicates that 5 questions from across all of BBH are used to form a 135 question tuning set. 2. The statement incorrectly states that 10 questions are used instead of 5. This factual inaccuracy changes the number of questions used in the tuning set, altering the method by which the tuning set is constituted."
        }
    },
    {
        "path": "table_paper/2407.00071v1.json",
        "table_id": "3",
        "section": "4",
        "all_context": [
            "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.",
            "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.",
            "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).",
            "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.",
            "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.",
            "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).",
            "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.",
            "was determined empirically on test questions.",
            "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.",
            "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
            "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
            "- to select the optimal values for the other four parameters.",
            "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.",
            "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)",
            "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
            "Figure 2 and Table 3 displays our results for BBH tasks.",
            "We manually evaluated the results for CR and zero-shot.",
            "The USP results are taken from (?).",
            "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?",
            "We performed a human evaluation at each stage of the CR pipeline.",
            "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
            "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
            "More results of the human evaluation can be found in the Appendix.",
            ""
        ],
        "target_context_ids": [
            14,
            15,
            16,
            17
        ],
        "selected_paragraphs": [
            "[paragraph id = 14] featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
            "[paragraph id = 15] Figure 2 and Table 3 displays our results for BBH tasks.",
            "[paragraph id = 16] We manually evaluated the results for CR and zero-shot.",
            "[paragraph id = 17] The USP results are taken from (?)."
        ],
        "table_html": "<figure class=\"ltx_table\" id=\"S4.T3\">\n<div class=\"ltx_inline-block ltx_transformed_outer\" id=\"S4.T3.2\" style=\"width:433.6pt;height:162.7pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(72.9pt,-27.3pt) scale(1.50642183704488,1.50642183704488) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T3.2.2\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.3.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_tt\" id=\"S4.T3.2.2.3.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.3.1.1.1\">Setting</span></th>\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T3.2.2.3.1.2\"></th>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T3.2.2.3.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.3.1.3.1\">Zero-Shot</span></td>\n<td class=\"ltx_td ltx_border_r ltx_border_tt\" id=\"S4.T3.2.2.3.1.4\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S4.T3.2.2.3.1.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.3.1.5.1\">Few-Shot</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.4.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T3.2.2.4.2.1\">Method</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.2.2.4.2.2\">0-Shot</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.2.2.4.2.3\">USP</td>\n<td class=\"ltx_td ltx_align_right ltx_border_r ltx_border_t\" id=\"S4.T3.2.2.4.2.4\">CR</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S4.T3.2.2.4.2.5\">3-Shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.5.3\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_r\" id=\"S4.T3.2.2.5.3.1\"></th>\n<th class=\"ltx_td ltx_th ltx_th_row\" id=\"S4.T3.2.2.5.3.2\"></th>\n<td class=\"ltx_td\" id=\"S4.T3.2.2.5.3.3\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_r\" id=\"S4.T3.2.2.5.3.4\">(Ours)</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S4.T3.2.2.5.3.5\">CoT</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.1.1.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T3.1.1.1.1\">Average (%) \n</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.1.1.1.2\">47.68</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.1.3\">55.89</td>\n<td class=\"ltx_td ltx_align_right ltx_border_r ltx_border_t\" id=\"S4.T3.1.1.1.4\">59.88</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S4.T3.1.1.1.5\">74.20</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.6.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T3.2.2.6.4.1\">Gain over zero-shot</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row\" id=\"S4.T3.2.2.6.4.2\">0</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.2.2.6.4.3\">+8.21</td>\n<td class=\"ltx_td ltx_align_right ltx_border_r\" id=\"S4.T3.2.2.6.4.4\">+12.20</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S4.T3.2.2.6.4.5\">+26.52</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb ltx_border_r\" id=\"S4.T3.2.2.2.1\">Average rank \n</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T3.2.2.2.2\">3.22</th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.2.2.2.3\">2.78</td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb ltx_border_r\" id=\"S4.T3.2.2.2.4\">2.57</td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb\" id=\"S4.T3.2.2.2.5\">1.35</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 3: </span>Cumulative statistics on the performance of different methods across BBH. CR outperforms the other zero-shot methods both by rank and on average.</figcaption>\n</figure>",
        "perturb_sentence_id": [
            14,
            15
        ],
        "output": {
            "perturbed_statement": "[paragraph id = 14] featuring non-default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times. Figure 4 and Table 2 displays our results for BBH tasks.",
            "perturbed_explanation": "1. The statement originally described the use of default settings on certain parameters and referred to specific figures and tables displaying results. \n2. The statement incorrectly claims that non-default settings on temperature were used, and that the results are displayed in Figure 4 and Table 2. However, the original context does not specify using non-default settings, nor does it mention results being shown in Figure 4 or Table 2 specifically for the BBH tasks, which could potentially be misleading about the actual figures and tables referenced for the results."
        }
    },
    {
        "path": "table_paper/2407.00071v1.json",
        "table_id": "4",
        "section": "4",
        "all_context": [
            "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.",
            "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.",
            "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).",
            "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.",
            "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.",
            "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).",
            "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.",
            "was determined empirically on test questions.",
            "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.",
            "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
            "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
            "- to select the optimal values for the other four parameters.",
            "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.",
            "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)",
            "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
            "Figure 2 and Table 3 displays our results for BBH tasks.",
            "We manually evaluated the results for CR and zero-shot.",
            "The USP results are taken from (?).",
            "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?",
            "We performed a human evaluation at each stage of the CR pipeline.",
            "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
            "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
            "More results of the human evaluation can be found in the Appendix.",
            ""
        ],
        "target_context_ids": [
            20,
            21
        ],
        "selected_paragraphs": [
            "[paragraph id = 20] In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
            "[paragraph id = 21] It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons."
        ],
        "table_html": "<figure class=\"ltx_table\" id=\"S4.T4\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S4.T4.4\" style=\"width:433.6pt;height:611.2pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(31.6pt,-44.6pt) scale(1.1708709263391,1.1708709263391) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T4.4.4\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T4.1.1.1\">\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T4.1.1.1.2\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T4.1.1.1.3\">All Reasons</th>\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T4.1.1.1.4\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T4.1.1.1.1\">% of \n</th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_r\" id=\"S4.T4.4.4.4.4\">Dataset</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T4.2.2.2.1\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T4.3.3.3.2\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S4.T4.4.4.4.3\"></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.5.1\">\n<td class=\"ltx_td ltx_align_left ltx_border_r ltx_border_t\" id=\"S4.T4.4.4.5.1.1\">Causal Judgement</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T4.4.4.5.1.2\">709</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T4.4.4.5.1.3\">204</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.4.4.5.1.4\">87.2</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.6.2\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.6.2.1\">Reasoning About Colored Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.6.2.2\">525</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.6.2.3\">100</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.6.2.4\">82.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.7.3\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.7.3.1\">Navigate</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.7.3.2\">1100</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.7.3.3\">572</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.7.3.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.8.4\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.8.4.1\">Penguins In A Table</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.8.4.2\">589</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.8.4.3\">123</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.8.4.4\">77.2</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.9.5\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.9.5.1\">Geometric Shapes</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.9.5.2\">630</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.9.5.3\">331</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.9.5.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.10.6\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.10.6.1\">Disambiguation QA</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.10.6.2\">373</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.10.6.3\">45</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.10.6.4\">68.9</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.11.7\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.11.7.1\">Tracking Shuffled Objects Five Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.11.7.2\">1020</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.11.7.3\">298</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.11.7.4\">95.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.12.8\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.12.8.1\">Word Sorting</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.12.8.2\">385</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.12.8.3\">107</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.12.8.4\">99.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.13.9\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.13.9.1\">Tracking Shuffled Objects Three Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.13.9.2\">743</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.13.9.3\">147</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.13.9.4\">64.6</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.14.10\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.14.10.1\">Tracking Shuffled Objects Seven Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.14.10.2\">1164</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.14.10.3\">400</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.14.10.4\">98.5</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.15.11\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.15.11.1\">Multistep Arithmetic Two</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.15.11.2\">621</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.15.11.3\">253</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.15.11.4\">99.6</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.16.12\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.16.12.1\">Web Of Lies</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.16.12.2\">885</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.16.12.3\">113</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.16.12.4\">84.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.17.13\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.17.13.1\">Logical Deduction Three Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.17.13.2\">540</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.17.13.3\">100</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.17.13.4\">72.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.18.14\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.18.14.1\">Sports Understanding</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.18.14.2\">449</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.18.14.3\">160</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.18.14.4\">96.3</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.19.15\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.19.15.1\">Snarks</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.19.15.2\">396</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.19.15.3\">109</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.19.15.4\">91.7</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.20.16\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.20.16.1\">Logical Deduction Five Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.20.16.2\">680</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.20.16.3\">199</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.20.16.4\">92.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.21.17\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.21.17.1\">Salient Translation Error Detection</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.21.17.2\">389</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.21.17.3\">90</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.21.17.4\">98.9</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.22.18\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.22.18.1\">Hyperbaton</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.22.18.2\">432</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.22.18.3\">57</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.22.18.4\">65.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.23.19\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.23.19.1\">Movie Recommendation</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.23.19.2\">730</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.23.19.3\">457</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.23.19.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.24.20\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.24.20.1\">Object Counting</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.24.20.2\">397</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.24.20.3\">48</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.24.20.4\">62.5</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.25.21\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.25.21.1\">Logical Deduction Seven Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.25.21.2\">730</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.25.21.3\">309</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.25.21.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.26.22\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.26.22.1\">Temporal Sequences</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.26.22.2\">533</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.26.22.3\">76</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.26.22.4\">97.3</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.27.23\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.27.23.1\">Formal Fallacies</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.27.23.2\">579</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.27.23.3\">251</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.27.23.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.28.24\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.28.24.1\">Dyck Languages</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.28.24.2\">1112</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.28.24.3\">558</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.28.24.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.29.25\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.29.25.1\">Date Understanding</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.29.25.2\">587</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.29.25.3\">162</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.29.25.4\">98.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.30.26\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.30.26.1\">Boolean Expressions</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.30.26.2\">493</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.30.26.3\">160</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.30.26.4\">93.8</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.31.27\">\n<td class=\"ltx_td ltx_align_left ltx_border_bb ltx_border_r\" id=\"S4.T4.4.4.31.27.1\">Ruin Names</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r\" id=\"S4.T4.4.4.31.27.2\">622</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r\" id=\"S4.T4.4.4.31.27.3\">421</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.4.4.31.27.4\">100.0</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 4: </span>Reason filtering and selection percentages</figcaption>\n</figure>",
        "perturb_sentence_id": [
            20,
            21
        ],
        "output": {
            "perturbed_statement": "[paragraph id = 20] In Table 4 we report the number of sampled reasons in stages before and after the stages depicted in Figure 3. It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
            "perturbed_explanation": "The original explanation highlights that Table 4 reports the number of sampled reasons before and after the stages depicted in Figure 2, illustrating the effect of optimization. 1. The statement mentions Figure 3 instead of Figure 2. This is factually incorrect because the stages referred to in Table 4 correspond to Figure 2, not Figure 3. By pointing to the wrong figure, the statement misleads the reader about the source of the depicted stages."
        }
    }
]