[ { "path": "table_paper/2407.00071v1.json", "table_id": "1", "section": "2.2", "all_context": [ "There have been many papers that suggest that LLMs can indeed reason (?", "For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.", "However, these results are not a good indicator for the autonomous reasoning capabilities of the model.", "In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.", "The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts.", "Thus we believe the next milestone for LLMs is automatic prompt generation with correct reasoning.", "The main inspiration for our work comes from Yan LeCun s review (?)", "which suggests multiple models need to work together to emulate general intelligence and that human brain possibly calculates a “cost function” for reasoning in a gradient-free manner - similar to combinatorial optimization.", "" ], "target_context_ids": [ 1, 3, 4 ], "selected_paragraphs": [ "[paragraph id = 1] For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.", "[paragraph id = 3] In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.", "[paragraph id = 4] The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Gemini UltraGPT-4LLama3 70B
MMLU90.04% CoT@3286.4% 5-shot79.5% 5-shot
GSM8K94.4% Maj1@3292% 5-Shot CoT93.0 8-shot
MATH53.2% 4-shot50.4 4-shot
BIG-Bench-Hard83.6% 3-shot81.3 3-shot, CoT
DROP82.4% Variable shot80.9 3-shot79.7 3-shot,F1
HellaSwag87.8% 10-shot95.3% 10-shot
WinoGrande87.5% 5-shot81.3% 5-shot
AI2 Reasoning96.3% 25-shot93% 25-shot
\n
Table 1: Summary of recent reasoning benchmarks on LLMs (?). Note that reported results are all dependent on some form of In Context Learning or Chain of Thought (CoT)
\n
", "perturb_sentence_id": [ 1, 2 ], "output": { "perturbed_statement": "[paragraph id = 1] For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever declining results. However, these results are not a good indicator for the autonomous reasoning capabilities of the model.", "perturbed_explanation": "Original Explanation: The statement highlights that the improvements in reasoning benchmarks do not necessarily reflect the model's autonomous reasoning abilities. These tasks are often structured in a manner that allows for improvement through specific techniques like few-shot learning or manual prompt engineering, which may not correlate with true reasoning capabilities. \n2. The statement is incorrect because it suggests that the benchmarks show ever declining results. In fact, the context implies that improvements in benchmarks are generally observed with newer revisions of language models, not a decline." } }, { "path": "table_paper/2407.00071v1.json", "table_id": "2", "section": "4", "all_context": [ "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.", "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.", "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).", "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.", "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.", "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).", "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.", "was determined empirically on test questions.", "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.", "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "- to select the optimal values for the other four parameters.", "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.", "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)", "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "Figure 2 and Table 3 displays our results for BBH tasks.", "We manually evaluated the results for CR and zero-shot.", "The USP results are taken from (?).", "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?", "We performed a human evaluation at each stage of the CR pipeline.", "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "More results of the human evaluation can be found in the Appendix.", "" ], "target_context_ids": [ 9, 10, 11 ], "selected_paragraphs": [ "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "[paragraph id = 10] On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "[paragraph id = 11] - to select the optimal values for the other four parameters." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Parameter
Tuning Range[1E-3, 70][1E-4, 20][-2, 10][1, 4]2
\n
\n
Table 2: Parameter Tuning Ranges
\n
", "perturb_sentence_id": [ 9, 10 ], "output": { "perturbed_statement": "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 10 questions from across all of BBH to form a 135 question tuning set. On this, we set the ranges for the tuning (see Table 2) and use Optuna - a gradient free hyperparameter optimization framework (?)", "perturbed_explanation": "1. The original statement indicates that 5 questions from across all of BBH are used to form a 135 question tuning set. 2. The statement incorrectly states that 10 questions are used instead of 5. This factual inaccuracy changes the number of questions used in the tuning set, altering the method by which the tuning set is constituted." } }, { "path": "table_paper/2407.00071v1.json", "table_id": "3", "section": "4", "all_context": [ "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.", "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.", "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).", "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.", "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.", "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).", "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.", "was determined empirically on test questions.", "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.", "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "- to select the optimal values for the other four parameters.", "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.", "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)", "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "Figure 2 and Table 3 displays our results for BBH tasks.", "We manually evaluated the results for CR and zero-shot.", "The USP results are taken from (?).", "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?", "We performed a human evaluation at each stage of the CR pipeline.", "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "More results of the human evaluation can be found in the Appendix.", "" ], "target_context_ids": [ 14, 15, 16, 17 ], "selected_paragraphs": [ "[paragraph id = 14] featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "[paragraph id = 15] Figure 2 and Table 3 displays our results for BBH tasks.", "[paragraph id = 16] We manually evaluated the results for CR and zero-shot.", "[paragraph id = 17] The USP results are taken from (?)." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
SettingZero-ShotFew-Shot
Method0-ShotUSPCR3-Shot
(Ours)CoT
Average (%) \n47.6855.8959.8874.20
Gain over zero-shot0+8.21+12.20+26.52
Average rank \n3.222.782.571.35
\n
\n
Table 3: Cumulative statistics on the performance of different methods across BBH. CR outperforms the other zero-shot methods both by rank and on average.
\n
", "perturb_sentence_id": [ 14, 15 ], "output": { "perturbed_statement": "[paragraph id = 14] featuring non-default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times. Figure 4 and Table 2 displays our results for BBH tasks.", "perturbed_explanation": "1. The statement originally described the use of default settings on certain parameters and referred to specific figures and tables displaying results. \n2. The statement incorrectly claims that non-default settings on temperature were used, and that the results are displayed in Figure 4 and Table 2. However, the original context does not specify using non-default settings, nor does it mention results being shown in Figure 4 or Table 2 specifically for the BBH tasks, which could potentially be misleading about the actual figures and tables referenced for the results." } }, { "path": "table_paper/2407.00071v1.json", "table_id": "4", "section": "4", "all_context": [ "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.", "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.", "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).", "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.", "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.", "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).", "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.", "was determined empirically on test questions.", "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.", "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "- to select the optimal values for the other four parameters.", "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.", "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)", "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "Figure 2 and Table 3 displays our results for BBH tasks.", "We manually evaluated the results for CR and zero-shot.", "The USP results are taken from (?).", "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?", "We performed a human evaluation at each stage of the CR pipeline.", "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "More results of the human evaluation can be found in the Appendix.", "" ], "target_context_ids": [ 20, 21 ], "selected_paragraphs": [ "[paragraph id = 20] In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "[paragraph id = 21] It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
All Reasons% of \n
Dataset
Causal Judgement70920487.2
Reasoning About Colored Objects52510082.0
Navigate1100572100.0
Penguins In A Table58912377.2
Geometric Shapes630331100.0
Disambiguation QA3734568.9
Tracking Shuffled Objects Five Objects102029895.0
Word Sorting38510799.1
Tracking Shuffled Objects Three Objects74314764.6
Tracking Shuffled Objects Seven Objects116440098.5
Multistep Arithmetic Two62125399.6
Web Of Lies88511384.1
Logical Deduction Three Objects54010072.0
Sports Understanding44916096.3
Snarks39610991.7
Logical Deduction Five Objects68019992.0
Salient Translation Error Detection3899098.9
Hyperbaton4325765.0
Movie Recommendation730457100.0
Object Counting3974862.5
Logical Deduction Seven Objects730309100.0
Temporal Sequences5337697.3
Formal Fallacies579251100.0
Dyck Languages1112558100.0
Date Understanding58716298.1
Boolean Expressions49316093.8
Ruin Names622421100.0
\n
\n
Table 4: Reason filtering and selection percentages
\n
", "perturb_sentence_id": [ 20, 21 ], "output": { "perturbed_statement": "[paragraph id = 20] In Table 4 we report the number of sampled reasons in stages before and after the stages depicted in Figure 3. It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "perturbed_explanation": "The original explanation highlights that Table 4 reports the number of sampled reasons before and after the stages depicted in Figure 2, illustrating the effect of optimization. 1. The statement mentions Figure 3 instead of Figure 2. This is factually incorrect because the stages referred to in Table 4 correspond to Figure 2, not Figure 3. By pointing to the wrong figure, the statement misleads the reader about the source of the depicted stages." } } ]