perturb_for_table / table_result /2407.00071v1_output.json
linbojunzi's picture
Upload 30 files
fd31a8c verified
raw
history blame
40 kB
[
{
"path": "table_paper/2407.00071v1.json",
"table_id": "1",
"section": "2.2",
"all_context": [
"There have been many papers that suggest that LLMs can indeed reason (?",
"For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.",
"However, these results are not a good indicator for the autonomous reasoning capabilities of the model.",
"In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.",
"The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts.",
"Thus we believe the next milestone for LLMs is automatic prompt generation with correct reasoning.",
"The main inspiration for our work comes from Yan LeCun s review (?)",
"which suggests multiple models need to work together to emulate general intelligence and that human brain possibly calculates a “cost function” for reasoning in a gradient-free manner - similar to combinatorial optimization.",
""
],
"target_context_ids": [
1,
3,
4
],
"selected_paragraphs": [
"[paragraph id = 1] For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.",
"[paragraph id = 3] In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.",
"[paragraph id = 4] The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts."
],
"table_html": "<figure class=\"ltx_table\" id=\"S2.T1\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S2.T1.1\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S2.T1.1.1.1\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_tt\" id=\"S2.T1.1.1.1.1\"></th>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S2.T1.1.1.1.2\">Gemini Ultra</td>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S2.T1.1.1.1.3\">GPT-4</td>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S2.T1.1.1.1.4\">LLama3 70B</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.2.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.2.2.1\">MMLU</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.2.2.2\">90.04% CoT@32</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.2.2.3\">86.4% 5-shot</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.2.2.4\">79.5% 5-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.3.3\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S2.T1.1.3.3.1\">GSM8K</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.3.3.2\">94.4% Maj1@32</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.3.3.3\">92% 5-Shot CoT</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.3.3.4\">93.0 8-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.4.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S2.T1.1.4.4.1\">MATH</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.4.4.2\">53.2% 4-shot</td>\n<td class=\"ltx_td ltx_border_t\" id=\"S2.T1.1.4.4.3\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.4.4.4\">50.4 4-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.5.5\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S2.T1.1.5.5.1\">BIG-Bench-Hard</th>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.5.5.2\">83.6% 3-shot</td>\n<td class=\"ltx_td ltx_border_t\" id=\"S2.T1.1.5.5.3\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S2.T1.1.5.5.4\">81.3 3-shot, CoT</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.6.6\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.6.6.1\">DROP</th>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.6.6.2\">82.4% Variable shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.6.6.3\">80.9 3-shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.6.6.4\">79.7 3-shot,F1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.7.7\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.7.7.1\">HellaSwag</th>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.7.7.2\">87.8% 10-shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.7.7.3\">95.3% 10-shot</td>\n<td class=\"ltx_td\" id=\"S2.T1.1.7.7.4\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.8.8\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S2.T1.1.8.8.1\">WinoGrande</th>\n<td class=\"ltx_td\" id=\"S2.T1.1.8.8.2\"></td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.8.8.3\">87.5% 5-shot</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S2.T1.1.8.8.4\">81.3% 5-shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S2.T1.1.9.9\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S2.T1.1.9.9.1\">AI2 Reasoning</th>\n<td class=\"ltx_td ltx_border_bb\" id=\"S2.T1.1.9.9.2\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb\" id=\"S2.T1.1.9.9.3\">96.3% 25-shot</td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb\" id=\"S2.T1.1.9.9.4\">93% 25-shot</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span>Summary of recent reasoning benchmarks on LLMs (<span class=\"ltx_text ltx_font_bold\" id=\"S2.T1.3.1\">?</span>). Note that reported results are all dependent on some form of In Context Learning or Chain of Thought (CoT) </figcaption>\n</figure>",
"perturb_sentence_id": 1,
"output": {
"perturbed_statement": "[paragraph id = 1] For each revision of LLMs - GPT4 and Gemini show declining results on reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU. Moreover, these results are a strong indicator for the autonomous reasoning capabilities of the model.",
"perturbed_explanation": "The original explanation argues that while reasoning benchmarks for subsequent revisions of LLMs like GPT4, Gemini, and Llama3 exhibit increasing performance, such metrics do not effectively measure the models' autonomous reasoning abilities. 1. The statement falsely suggests a decline in benchmark performance, contrary to the assertion of improvement. 2. It also incorrectly claims these benchmarks are reliable indicators of reasoning autonomy, which is contrary to their stated limitations. By making these statements, the statement misrepresents the trends and implications of these benchmarks."
}
},
{
"path": "table_paper/2407.00071v1.json",
"table_id": "2",
"section": "4",
"all_context": [
"We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.",
"This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.",
"We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).",
"To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.",
"On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.",
"Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).",
"For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.",
"was determined empirically on test questions.",
"To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.",
"Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
"On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
"- to select the optimal values for the other four parameters.",
"We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.",
"For the Ising solver, we utilized an open-source implementation of simulated annealing (?)",
"featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
"Figure 2 and Table 3 displays our results for BBH tasks.",
"We manually evaluated the results for CR and zero-shot.",
"The USP results are taken from (?).",
"While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?",
"We performed a human evaluation at each stage of the CR pipeline.",
"In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
"It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
"More results of the human evaluation can be found in the Appendix.",
""
],
"target_context_ids": [
9,
10,
11
],
"selected_paragraphs": [
"[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
"[paragraph id = 10] On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
"[paragraph id = 11] - to select the optimal values for the other four parameters."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T2\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S4.T2.5\" style=\"width:433.6pt;height:56.9pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(79.5pt,-10.4pt) scale(1.57928347350645,1.57928347350645) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T2.5.5\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T2.5.5.5\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.5.5.5.6\" style=\"padding-bottom:2.15277pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.5.5.5.6.1\">Parameter</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.1.1.1.1\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.2.2.2.2\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.3.3.3.3\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T2.4.4.4.4\" style=\"padding-bottom:2.15277pt;\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T2.5.5.5.5\" style=\"padding-bottom:2.15277pt;\"></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T2.5.5.6.1\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.5.5.6.1.1.1\">Tuning Range</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.2\">[1E-3, 70]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.3\">[1E-4, 20]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.4\">[-2, 10]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t\" id=\"S4.T2.5.5.6.1.5\">[1, 4]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_t\" id=\"S4.T2.5.5.6.1.6\">2</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 2: </span>Parameter Tuning Ranges</figcaption>\n</figure>",
"perturb_sentence_id": 9,
"output": {
"perturbed_statement": "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 10 questions from across all of BBH to form a 135 question tuning set.On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
"perturbed_explanation": "The original explanation mentions the use of 5 questions as part of the tuning set, accompanied by the clarification regarding the formation and optimization framework used. 2) Changing the count from 5 to 10 inaccurately increases the number of questions utilized in this context, leading to a factual error."
}
},
{
"path": "table_paper/2407.00071v1.json",
"table_id": "3",
"section": "4",
"all_context": [
"We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.",
"This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.",
"We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).",
"To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.",
"On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.",
"Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).",
"For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.",
"was determined empirically on test questions.",
"To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.",
"Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
"On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
"- to select the optimal values for the other four parameters.",
"We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.",
"For the Ising solver, we utilized an open-source implementation of simulated annealing (?)",
"featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
"Figure 2 and Table 3 displays our results for BBH tasks.",
"We manually evaluated the results for CR and zero-shot.",
"The USP results are taken from (?).",
"While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?",
"We performed a human evaluation at each stage of the CR pipeline.",
"In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
"It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
"More results of the human evaluation can be found in the Appendix.",
""
],
"target_context_ids": [
14,
15,
16,
17
],
"selected_paragraphs": [
"[paragraph id = 14] featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
"[paragraph id = 15] Figure 2 and Table 3 displays our results for BBH tasks.",
"[paragraph id = 16] We manually evaluated the results for CR and zero-shot.",
"[paragraph id = 17] The USP results are taken from (?)."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T3\">\n<div class=\"ltx_inline-block ltx_transformed_outer\" id=\"S4.T3.2\" style=\"width:433.6pt;height:162.7pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(72.9pt,-27.3pt) scale(1.50642183704488,1.50642183704488) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T3.2.2\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.3.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_tt\" id=\"S4.T3.2.2.3.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.3.1.1.1\">Setting</span></th>\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T3.2.2.3.1.2\"></th>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T3.2.2.3.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.3.1.3.1\">Zero-Shot</span></td>\n<td class=\"ltx_td ltx_border_r ltx_border_tt\" id=\"S4.T3.2.2.3.1.4\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_tt\" id=\"S4.T3.2.2.3.1.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.3.1.5.1\">Few-Shot</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.4.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T3.2.2.4.2.1\">Method</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.2.2.4.2.2\">0-Shot</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.2.2.4.2.3\">USP</td>\n<td class=\"ltx_td ltx_align_right ltx_border_r ltx_border_t\" id=\"S4.T3.2.2.4.2.4\">CR</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S4.T3.2.2.4.2.5\">3-Shot</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.5.3\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_r\" id=\"S4.T3.2.2.5.3.1\"></th>\n<th class=\"ltx_td ltx_th ltx_th_row\" id=\"S4.T3.2.2.5.3.2\"></th>\n<td class=\"ltx_td\" id=\"S4.T3.2.2.5.3.3\"></td>\n<td class=\"ltx_td ltx_align_right ltx_border_r\" id=\"S4.T3.2.2.5.3.4\">(Ours)</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S4.T3.2.2.5.3.5\">CoT</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.1.1.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T3.1.1.1.1\">Average (%) \n</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.1.1.1.2\">47.68</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.1.3\">55.89</td>\n<td class=\"ltx_td ltx_align_right ltx_border_r ltx_border_t\" id=\"S4.T3.1.1.1.4\">59.88</td>\n<td class=\"ltx_td ltx_align_right ltx_border_t\" id=\"S4.T3.1.1.1.5\">74.20</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.6.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T3.2.2.6.4.1\">Gain over zero-shot</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row\" id=\"S4.T3.2.2.6.4.2\">0</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.2.2.6.4.3\">+8.21</td>\n<td class=\"ltx_td ltx_align_right ltx_border_r\" id=\"S4.T3.2.2.6.4.4\">+12.20</td>\n<td class=\"ltx_td ltx_align_right\" id=\"S4.T3.2.2.6.4.5\">+26.52</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.2.2.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb ltx_border_r\" id=\"S4.T3.2.2.2.1\">Average rank \n</th>\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T3.2.2.2.2\">3.22</th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.2.2.2.3\">2.78</td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb ltx_border_r\" id=\"S4.T3.2.2.2.4\">2.57</td>\n<td class=\"ltx_td ltx_align_right ltx_border_bb\" id=\"S4.T3.2.2.2.5\">1.35</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 3: </span>Cumulative statistics on the performance of different methods across BBH. CR outperforms the other zero-shot methods both by rank and on average.</figcaption>\n</figure>",
"perturb_sentence_id": 14,
"output": {
"perturbed_statement": "[paragraph id = 14] featuring default settings on temperature, exponential annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times. Figure 2 and Table 3 display our results for BBH tasks.",
"perturbed_explanation": "The original explanation stated that the annealing schedule type was specified as linear. Introducing the alteration of changing it to exponential introduces a factual discrepancy. Consequently, by the unchanged contextual details, the statement specifying exponential annealing is incorrect."
}
},
{
"path": "table_paper/2407.00071v1.json",
"table_id": "4",
"section": "4",
"all_context": [
"We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.",
"This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.",
"We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).",
"To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.",
"On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.",
"Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).",
"For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.",
"was determined empirically on test questions.",
"To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.",
"Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.",
"On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)",
"- to select the optimal values for the other four parameters.",
"We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.",
"For the Ising solver, we utilized an open-source implementation of simulated annealing (?)",
"featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.",
"Figure 2 and Table 3 displays our results for BBH tasks.",
"We manually evaluated the results for CR and zero-shot.",
"The USP results are taken from (?).",
"While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?",
"We performed a human evaluation at each stage of the CR pipeline.",
"In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
"It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.",
"More results of the human evaluation can be found in the Appendix.",
""
],
"target_context_ids": [
20,
21
],
"selected_paragraphs": [
"[paragraph id = 20] In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .",
"[paragraph id = 21] It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T4\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S4.T4.4\" style=\"width:433.6pt;height:611.2pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(31.6pt,-44.6pt) scale(1.1708709263391,1.1708709263391) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T4.4.4\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T4.1.1.1\">\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T4.1.1.1.2\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T4.1.1.1.3\">All Reasons</th>\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r ltx_border_tt\" id=\"S4.T4.1.1.1.4\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T4.1.1.1.1\">% of \n</th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_r\" id=\"S4.T4.4.4.4.4\">Dataset</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T4.2.2.2.1\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T4.3.3.3.2\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S4.T4.4.4.4.3\"></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.5.1\">\n<td class=\"ltx_td ltx_align_left ltx_border_r ltx_border_t\" id=\"S4.T4.4.4.5.1.1\">Causal Judgement</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T4.4.4.5.1.2\">709</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T4.4.4.5.1.3\">204</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.4.4.5.1.4\">87.2</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.6.2\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.6.2.1\">Reasoning About Colored Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.6.2.2\">525</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.6.2.3\">100</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.6.2.4\">82.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.7.3\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.7.3.1\">Navigate</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.7.3.2\">1100</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.7.3.3\">572</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.7.3.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.8.4\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.8.4.1\">Penguins In A Table</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.8.4.2\">589</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.8.4.3\">123</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.8.4.4\">77.2</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.9.5\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.9.5.1\">Geometric Shapes</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.9.5.2\">630</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.9.5.3\">331</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.9.5.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.10.6\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.10.6.1\">Disambiguation QA</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.10.6.2\">373</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.10.6.3\">45</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.10.6.4\">68.9</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.11.7\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.11.7.1\">Tracking Shuffled Objects Five Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.11.7.2\">1020</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.11.7.3\">298</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.11.7.4\">95.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.12.8\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.12.8.1\">Word Sorting</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.12.8.2\">385</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.12.8.3\">107</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.12.8.4\">99.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.13.9\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.13.9.1\">Tracking Shuffled Objects Three Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.13.9.2\">743</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.13.9.3\">147</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.13.9.4\">64.6</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.14.10\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.14.10.1\">Tracking Shuffled Objects Seven Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.14.10.2\">1164</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.14.10.3\">400</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.14.10.4\">98.5</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.15.11\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.15.11.1\">Multistep Arithmetic Two</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.15.11.2\">621</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.15.11.3\">253</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.15.11.4\">99.6</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.16.12\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.16.12.1\">Web Of Lies</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.16.12.2\">885</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.16.12.3\">113</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.16.12.4\">84.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.17.13\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.17.13.1\">Logical Deduction Three Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.17.13.2\">540</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.17.13.3\">100</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.17.13.4\">72.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.18.14\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.18.14.1\">Sports Understanding</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.18.14.2\">449</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.18.14.3\">160</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.18.14.4\">96.3</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.19.15\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.19.15.1\">Snarks</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.19.15.2\">396</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.19.15.3\">109</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.19.15.4\">91.7</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.20.16\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.20.16.1\">Logical Deduction Five Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.20.16.2\">680</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.20.16.3\">199</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.20.16.4\">92.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.21.17\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.21.17.1\">Salient Translation Error Detection</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.21.17.2\">389</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.21.17.3\">90</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.21.17.4\">98.9</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.22.18\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.22.18.1\">Hyperbaton</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.22.18.2\">432</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.22.18.3\">57</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.22.18.4\">65.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.23.19\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.23.19.1\">Movie Recommendation</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.23.19.2\">730</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.23.19.3\">457</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.23.19.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.24.20\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.24.20.1\">Object Counting</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.24.20.2\">397</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.24.20.3\">48</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.24.20.4\">62.5</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.25.21\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.25.21.1\">Logical Deduction Seven Objects</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.25.21.2\">730</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.25.21.3\">309</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.25.21.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.26.22\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.26.22.1\">Temporal Sequences</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.26.22.2\">533</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.26.22.3\">76</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.26.22.4\">97.3</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.27.23\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.27.23.1\">Formal Fallacies</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.27.23.2\">579</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.27.23.3\">251</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.27.23.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.28.24\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.28.24.1\">Dyck Languages</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.28.24.2\">1112</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.28.24.3\">558</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.28.24.4\">100.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.29.25\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.29.25.1\">Date Understanding</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.29.25.2\">587</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.29.25.3\">162</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.29.25.4\">98.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.30.26\">\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T4.4.4.30.26.1\">Boolean Expressions</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.30.26.2\">493</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T4.4.4.30.26.3\">160</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.30.26.4\">93.8</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4.31.27\">\n<td class=\"ltx_td ltx_align_left ltx_border_bb ltx_border_r\" id=\"S4.T4.4.4.31.27.1\">Ruin Names</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r\" id=\"S4.T4.4.4.31.27.2\">622</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb ltx_border_r\" id=\"S4.T4.4.4.31.27.3\">421</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.4.4.31.27.4\">100.0</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 4: </span>Reason filtering and selection percentages</figcaption>\n</figure>",
"perturb_sentence_id": 20,
"output": {
"perturbed_statement": "[paragraph id = 20] In Table 4 we report the percentage change in reasons between stages depicted in Figure 3. It should be noted that the optimization process increases the variety of reasons in this analysis.",
"perturbed_explanation": "1. The effect of optimization as stated originally is to reduce the number of distinct reasons to a subset for streamlined analysis. 2. The assertion that optimization increases the variety of reasons contradicts this fundamental understanding, and the reference to Figure 3 does not align correctly with the discussed content from Figure 2 as outlined."
}
}
]