[ { "path": "table_paper/2407.00071v1.json", "table_id": "1", "section": "2.2", "all_context": [ "There have been many papers that suggest that LLMs can indeed reason (?", "For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.", "However, these results are not a good indicator for the autonomous reasoning capabilities of the model.", "In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.", "The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts.", "Thus we believe the next milestone for LLMs is automatic prompt generation with correct reasoning.", "The main inspiration for our work comes from Yan LeCun s review (?)", "which suggests multiple models need to work together to emulate general intelligence and that human brain possibly calculates a “cost function” for reasoning in a gradient-free manner - similar to combinatorial optimization.", "" ], "target_context_ids": [ 1, 3, 4 ], "selected_paragraphs": [ "[paragraph id = 1] For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results.", "[paragraph id = 3] In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.", "[paragraph id = 4] The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Gemini UltraGPT-4LLama3 70B
MMLU90.04% CoT@3286.4% 5-shot79.5% 5-shot
GSM8K94.4% Maj1@3292% 5-Shot CoT93.0 8-shot
MATH53.2% 4-shot50.4 4-shot
BIG-Bench-Hard83.6% 3-shot81.3 3-shot, CoT
DROP82.4% Variable shot80.9 3-shot79.7 3-shot,F1
HellaSwag87.8% 10-shot95.3% 10-shot
WinoGrande87.5% 5-shot81.3% 5-shot
AI2 Reasoning96.3% 25-shot93% 25-shot
\n
Table 1: Summary of recent reasoning benchmarks on LLMs (?). Note that reported results are all dependent on some form of In Context Learning or Chain of Thought (CoT)
\n
", "perturb_sentence_id": 1, "output": { "perturbed_statement": "[paragraph id = 1] For each revision of LLMs - GPT4 and Gemini show declining results on reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU. Moreover, these results are a strong indicator for the autonomous reasoning capabilities of the model.", "perturbed_explanation": "The original explanation argues that while reasoning benchmarks for subsequent revisions of LLMs like GPT4, Gemini, and Llama3 exhibit increasing performance, such metrics do not effectively measure the models' autonomous reasoning abilities. 1. The statement falsely suggests a decline in benchmark performance, contrary to the assertion of improvement. 2. It also incorrectly claims these benchmarks are reliable indicators of reasoning autonomy, which is contrary to their stated limitations. By making these statements, the statement misrepresents the trends and implications of these benchmarks." } }, { "path": "table_paper/2407.00071v1.json", "table_id": "2", "section": "4", "all_context": [ "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.", "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.", "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).", "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.", "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.", "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).", "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.", "was determined empirically on test questions.", "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.", "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "- to select the optimal values for the other four parameters.", "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.", "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)", "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "Figure 2 and Table 3 displays our results for BBH tasks.", "We manually evaluated the results for CR and zero-shot.", "The USP results are taken from (?).", "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?", "We performed a human evaluation at each stage of the CR pipeline.", "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "More results of the human evaluation can be found in the Appendix.", "" ], "target_context_ids": [ 9, 10, 11 ], "selected_paragraphs": [ "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "[paragraph id = 10] On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "[paragraph id = 11] - to select the optimal values for the other four parameters." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Parameter
Tuning Range[1E-3, 70][1E-4, 20][-2, 10][1, 4]2
\n
\n
Table 2: Parameter Tuning Ranges
\n
", "perturb_sentence_id": 9, "output": { "perturbed_statement": "[paragraph id = 9] Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 10 questions from across all of BBH to form a 135 question tuning set.On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "perturbed_explanation": "The original explanation mentions the use of 5 questions as part of the tuning set, accompanied by the clarification regarding the formation and optimization framework used. 2) Changing the count from 5 to 10 inaccurately increases the number of questions utilized in this context, leading to a factual error." } }, { "path": "table_paper/2407.00071v1.json", "table_id": "3", "section": "4", "all_context": [ "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.", "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.", "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).", "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.", "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.", "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).", "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.", "was determined empirically on test questions.", "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.", "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "- to select the optimal values for the other four parameters.", "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.", "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)", "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "Figure 2 and Table 3 displays our results for BBH tasks.", "We manually evaluated the results for CR and zero-shot.", "The USP results are taken from (?).", "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?", "We performed a human evaluation at each stage of the CR pipeline.", "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "More results of the human evaluation can be found in the Appendix.", "" ], "target_context_ids": [ 14, 15, 16, 17 ], "selected_paragraphs": [ "[paragraph id = 14] featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "[paragraph id = 15] Figure 2 and Table 3 displays our results for BBH tasks.", "[paragraph id = 16] We manually evaluated the results for CR and zero-shot.", "[paragraph id = 17] The USP results are taken from (?)." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
SettingZero-ShotFew-Shot
Method0-ShotUSPCR3-Shot
(Ours)CoT
Average (%) \n47.6855.8959.8874.20
Gain over zero-shot0+8.21+12.20+26.52
Average rank \n3.222.782.571.35
\n
\n
Table 3: Cumulative statistics on the performance of different methods across BBH. CR outperforms the other zero-shot methods both by rank and on average.
\n
", "perturb_sentence_id": 14, "output": { "perturbed_statement": "[paragraph id = 14] featuring default settings on temperature, exponential annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times. Figure 2 and Table 3 display our results for BBH tasks.", "perturbed_explanation": "The original explanation stated that the annealing schedule type was specified as linear. Introducing the alteration of changing it to exponential introduces a factual discrepancy. Consequently, by the unchanged contextual details, the statement specifying exponential annealing is incorrect." } }, { "path": "table_paper/2407.00071v1.json", "table_id": "4", "section": "4", "all_context": [ "We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens.", "This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.", "We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?).", "To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness.", "On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting.", "Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).", "For the Sampling of Reasons step, we sampled the LLM times at to collect sufficient distinct reasons, and calculate their distribution and correlations matrices.", "was determined empirically on test questions.", "To map to distinct reason, the similarity threshold is held to =0.90, again determined empirically.", "Prior to running the QUBO mapper, we tune the mapping parameters , , , and ( is fixed) using 5 questions from across all of BBH to form a 135 question tuning set.", "On this, we set the ranges for the tuning (see Table 2 ) and use Optuna - a gradient free hyperparameter optimization framework (?)", "- to select the optimal values for the other four parameters.", "We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.", "For the Ising solver, we utilized an open-source implementation of simulated annealing (?)", "featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.", "Figure 2 and Table 3 displays our results for BBH tasks.", "We manually evaluated the results for CR and zero-shot.", "The USP results are taken from (?).", "While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?", "We performed a human evaluation at each stage of the CR pipeline.", "In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons.", "More results of the human evaluation can be found in the Appendix.", "" ], "target_context_ids": [ 20, 21 ], "selected_paragraphs": [ "[paragraph id = 20] In Table 4 we report the number of sampled reasons before and after the stages depicted in Figure 2 .", "[paragraph id = 21] It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
All Reasons% of \n
Dataset
Causal Judgement70920487.2
Reasoning About Colored Objects52510082.0
Navigate1100572100.0
Penguins In A Table58912377.2
Geometric Shapes630331100.0
Disambiguation QA3734568.9
Tracking Shuffled Objects Five Objects102029895.0
Word Sorting38510799.1
Tracking Shuffled Objects Three Objects74314764.6
Tracking Shuffled Objects Seven Objects116440098.5
Multistep Arithmetic Two62125399.6
Web Of Lies88511384.1
Logical Deduction Three Objects54010072.0
Sports Understanding44916096.3
Snarks39610991.7
Logical Deduction Five Objects68019992.0
Salient Translation Error Detection3899098.9
Hyperbaton4325765.0
Movie Recommendation730457100.0
Object Counting3974862.5
Logical Deduction Seven Objects730309100.0
Temporal Sequences5337697.3
Formal Fallacies579251100.0
Dyck Languages1112558100.0
Date Understanding58716298.1
Boolean Expressions49316093.8
Ruin Names622421100.0
\n
\n
Table 4: Reason filtering and selection percentages
\n
", "perturb_sentence_id": 20, "output": { "perturbed_statement": "[paragraph id = 20] In Table 4 we report the percentage change in reasons between stages depicted in Figure 3. It should be noted that the optimization process increases the variety of reasons in this analysis.", "perturbed_explanation": "1. The effect of optimization as stated originally is to reduce the number of distinct reasons to a subset for streamlined analysis. 2. The assertion that optimization increases the variety of reasons contradicts this fundamental understanding, and the reference to Figure 3 does not align correctly with the discussed content from Figure 2 as outlined." } } ]