Problem Solving with Language Models

Community Article Published February 2, 2025

Introduction

Self-Correction

Reasoning-Step Verification

Task Decomposition

Reasoning Tasks May Require Reasoning Templates

Improving General Reasoning Performance

Program-Aided LMs

Faithfulness of the Reasoning Steps

Inference Costs

References

Introduction

Divide-and-conquer approaches are well-known to be effective. This strategy naturally also applies to large language models (LLMs). For example, Chain of Thought (CoT; Wei et al. (2022)) showed that solving problems step by step can lead to substantial performance gains. Based on this success, many works propose different variations and improvements, such as Tree of Thoughts (ToT; Yao et al. (2023)) and Graph of Thoughts (GoT; Besta et al. (2024)), among many others. In this blog post, we provide a high-level overview of different approaches to solving reasoning problems with language models, focusing on a selection of very recent papers (mainly 2024) rather than an exhaustive review of all new works.

Self-Correction

A common problem with all these methods is that the intermediate steps can contain errors that may negatively impact the reasoning chain and generate wrong conclusions. Many works have attempted to make LLMs correct errors in their reasoning chains (Shinn et al. 2024; Madaan et al. 2023; Pan et al. 2023; Kim, Baldi, and McAleer 2024; Weng et al. 2023; W. Jiang et al. 2023; Du et al. 2024; Paul et al. 2024; Saunders et al. 2022; Akyurek et al. 2023; Welleck et al. 2023; Estornell et al. 2024). However, Huang et al. (2024) argues that most of these works actually use some sort of external feedback or golden labels. Since the correction does not stem from the logical inference from the parametric knowledge and the input and output context, these methods do not conduct self-correction. Moreover, this may limit their application to scenarios where such external feedback is unavailable. Furthermore, Hong et al. (2024) shows that LLMs cannot identify logical fallacies; therefore, LLMs are not expected to be able to self-correct yet.

Reasoning-Step Verification

The lack of self-correction abilities in LMs may suggest an alternative method to correct potential mistakes in intermediate steps: the combination of an agent specialized in generating reasoning steps with an agent that provides feedback on each step. For example, Paul et al. (2024) propose such a system. To build it, they train two LMs. The critic is trained on a synthetic dataset that contains implausible hypotheses and feedback to fix the reasoning errors. The generator is trained on generating reasoning steps, receiving feedback from a frozen critic model, and correcting the rationale or generating the next one. Although this method shows significant performance gains, there are no experiments on whether a significantly smaller critic model can provide meaningful feedback to a larger model (for example, a < 1B critic paired with a > 1B generator) to make the verification step computationally cheap.

Task Decomposition

These issues can be exacerbated by challenging tasks such as those requiring long contexts. In these cases, a multi-agent approach to decompose the tasks can be helpful. For example, Zhang et al. (2024) decompose a very large input context into k chunks and process them with agents. They propose to leverage a communication channel between agents to facilitate the understanding of the context. Then, a final agent combines all the outputs of the other agents to give the final solution to the problem. This type of method is very appealing because they do not require training and is task-agnostic, interpretable, and cost-effective.

Reasoning Tasks May Require Reasoning Templates

Yang et al. (2024) argues that single-query methods such as CoT lack universality because they require examples that illustrate the reasoning process for the specific input tasks. On the other hand, multi-query methods such as ToT and GoT are significantly more expensive due to their recursive nature to expand reasoning paths. To address these limitations, they propose Buffer of Thoughts (BoT). This method includes a meta-buffer that stores templates to solve a wide range of problems. Then, given an input problem, the system retrieves the most relevant template and instantiates it with the given data. If the task is a new one, the system uses a general template, instantiates the problem, creates a new general template to solve such a task, and updates the meta-buffer with the new task template. This method makes BoT cheaper than multi-query methods and yields better performance. Moreover, it does not require training.

Improving General Reasoning Performance

To improve the abilities of LLMs to generate step-wise reasoning chains, we can aim to improve the overall reasoning abilities or to improve the generation of each step. For the former, Ranaldi and Freitas (2024) propose to improve the distillation of reasoning chains from large LMs to small LMs by adding a second step that applies preference optimization to steer the model toward correct CoT answers. J. C.-Y. Chen et al. (2024) argue that most works focus on forward reasoning, i.e., generating reasoning steps that can infer the answer to a question, and posit that humans can benefit from reverse thinking, i.e., starting from an answer, reasoning to reconstruct the question. Therefore, they propose a method to augment reasoning datasets with backward reasoning. With this method, they propose to do knowledge distillation and use a multi-task loss to teach the student model to generate forward and backward reasoning. By training this way, they show that the student model achieves better performance than training only on forward reasoning.

Regarding the generation of each reasoning step, Lai et al. (2024) proposes to conduct preference optimization on individual steps rather than on full chains. In this way, it is possible to provide fine-grain supervision. Moreover, they show that this method is data-efficient. Similarly, Xu et al. (2024) propose to use preference optimization with correct steps and with subtle errors in the steps to reduce the generation of small mistakes such as arithmetic operations.

Program-Aided LMs

Nevertheless, guaranteeing the correctness of each step remains challenging. Fortunately, it is possible to make the LLM call an external tool, which can give deterministic and correctness guarantees (for the tool input), such as calculators or code interpreters (Schick et al. 2023; Lyu et al. 2023; W. Chen et al. 2023; Gao et al. 2023). One of the challenges of this approach is how to extend the set of tools at runtime. Mekala et al. (2024) addresses this problem by decomposing the tool call generation into two sub-tasks: i) tool selection and ii) parameter generation. Moreover, each sub-task is followed by a verification step to reduce the likelihood of selecting the wrong tool. However, Y. Shen, Zhu, and Chen (2024) shows that LLMs tend to over-use some tools, and by calibrating the tool usage, it is possible to achieve performance gains. Another limitation of using tools with LLMs is that most works are constrained to specific scenarios. Shi et al. (2024) addresses this limitation by proposing Automatic Tool Chain. This method enables LLMs to use a chain of tools through code. Specifically, given the input-output schema of the tools, the LLM generates a program that calls a chain of tools, parsers their outputs, and derives the solution to the problem. Lastly, augmenting small LMs with tools remains challenging due to their size. W. Shen et al. (2024) proposes a multi-LLM agent framework for tool learning. This framework first fine-tunes an LLM on a dataset for problem-solving with tools. Then, it makes three copies of this trained LLM, and continues fine-tuning each of them to three complementary sub-tasks: i) planner, ii) caller, and iii) summarizer. Consequently, the use of tools is conducted through the collaboration between these three LLMs. However, this also increases the memory costs. Small experiments with LoRA (Hu et al. 2022) suggest lower performance than full fine-tuning but remain competitive or even outperform the baselines.

Faithfulness of the Reasoning Steps

One important question about all these multi-step reasoning methods for LLMs is whether they are faithful to the internal mechanisms of the LLMs. Kudo et al. (2024) investigates this question by analyzing whether trained classifiers can predict the values of variables in arithmetic tasks. They show that LLMs come up with the answer to simple problems before generating a CoT, while for more complex multi-step tasks, the LLM internally decides the answer after generating the CoT. Therefore, for simple problems, the CoT might not be faithful to the internal mechanism to reach the answer, while for more complex problems, it might be more faithful.

Inference Costs

Lastly, it is also important to consider the inference costs of these methods, especially considering that they usually require generating more tokens than the regular CoT method. Y. Jiang et al. (2024) propose to dynamically decide the execution of Transformers layers based on the importance of each token. In this way, uninformative tokens such as articles can be computed quickly, while important tokens can have full computation. Thanks to this, they can achieve significant computation cost reductions. However, skipping certain layers can cause KV-cache misses. Therefore, they propose dynamically modifying the attention masks to skip the cache for those tokens.

This brief survey has analyzed recent works on pressing topics in problem-solving in LLMs, including topics such as self-correction, step-wise reasoning, tools, and multi-agent systems.

References

Akyurek, Afra Feyza, Ekin Akyurek, Ashwin Kalyan, Peter Clark, Derry Tanti Wijaya, and Niket Tandon. 2023. “RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, 7716–33. Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.427.
Besta, Maciej, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, et al. 2024. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (16): 17682–90. https://doi.org/10.1609/aaai.v38i16.29720.
Chen, Justin Chih-Yao, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, et al. 2024. “Reverse Thinking Makes LLMs Stronger Reasoners.” https://arxiv.org/abs/2411.19865.
Chen, Wenhu, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks.” Transactions on Machine Learning Research. https://openreview.net/forum?id=YfZ4ZPt8zd.
Du, Yilun, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. “Improving Factuality and Reasoning in Language Models Through Multiagent Debate.” In Proceedings of the 41st International Conference on Machine Learning, edited by Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, 235:11733–63. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v235/du24e.html.
Estornell, Andrew, Jean-Francois Ton, Yuanshun Yao, and Yang Liu. 2024. “ACC-Debate: An Actor-Critic Approach to Multi-Agent Debate.” https://arxiv.org/abs/2411.00053.
Gao, Luyu, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. “PAL: Program-Aided Language Models.” In Proceedings of the 40th International Conference on Machine Learning. ICML’23. Honolulu, Hawaii, USA: JMLR.org.
Hong, Ruixin, Hongming Zhang, Xinyu Pang, Dong Yu, and Changshui Zhang. 2024. “A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning.” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), edited by Kevin Duh, Helena Gomez, and Steven Bethard, 900–925. Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.52.
Hu, Edward J, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. “LoRA: Low-Rank Adaptation of Large Language Models.” In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9.
Huang, Jie, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. “Large Language Models Cannot Self-Correct Reasoning Yet.” In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=IkmD3fKBPQ.
Jiang, Weisen, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, and James T. Kwok. 2023. “Forward-Backward Reasoning in Large Language Models for Mathematical Verification.” https://arxiv.org/abs/2308.07758.
Jiang, Yikun, Huanyu Wang, Lei Xie, Hanbin Zhao, Chao Zhang, Hui Qian, and John C. S. Lui. 2024. “D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models.” In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=UIOjGTKHQG.
Kim, Geunwoo, Pierre Baldi, and Stephen McAleer. 2024. “Language Models Can Solve Computer Tasks.” Advances in Neural Information Processing Systems 36. https://arxiv.org/pdf/2303.17491.
Kudo, Keito, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, and Kentaro Inui. 2024. “Think-to-Talk or Talk-to-Think? When LLMs Come up with an Answer in Multi-Step Reasoning.” https://arxiv.org/abs/2412.01113.
Lai, Xin, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. 2024. “Step-DPO: Step-Wise Preference Optimization for Long-Chain Reasoning of LLMs.” https://arxiv.org/abs/2406.18629.
Lyu, Qing, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. “Faithful Chain-of-Thought Reasoning.” In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, 305–29. Nusa Dua, Bali: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.ijcnlp-main.20.
Madaan, Aman, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, et al. 2023. “Self-Refine: Iterative Refinement with Self-Feedback.” In Thirty-Seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=S37hOerQLB.
Mekala, Dheeraj, Jason E Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, and Jane Dwivedi-Yu. 2024. “TOOLVERIFIER: Generalization to New Tools via Self-Verification.” In Findings of the Association for Computational Linguistics: EMNLP 2024, edited by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, 5026–41. Miami, Florida, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-emnlp.289.
Pan, Liangming, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. 2023. “Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Self-Correction Strategies.” arXiv Preprint arXiv:2308.03188. https://arxiv.org/abs/2308.03188.
Paul, Debjit, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2024. “REFINER: Reasoning Feedback on Intermediate Representations.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Yvette Graham and Matthew Purver, 1100–1126. St. Julian’s, Malta: Association for Computational Linguistics. https://aclanthology.org/2024.eacl-long.67.
Ranaldi, Leonardo, and Andre Freitas. 2024. “Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, edited by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, 2325–47. Miami, Florida, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.139.
Saunders, William, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. “Self-Critiquing Models for Assisting Human Evaluators.” arXiv Preprint arXiv:2206.05802. https://arxiv.org/abs/2206.05802.
Schick, Timo, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. “Toolformer: Language Models Can Teach Themselves to Use Tools.” In Thirty-Seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=Yacmpz84TH.
Shen, Weizhou, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. 2024. “Small LLMs Are Weak Tool Learners: A Multi-LLM Agent.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, edited by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, 16658–80. Miami, Florida, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.929.
Shen, Yuanhao, Xiaodan Zhu, and Lei Chen. 2024. “SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, edited by Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina, 774–89. Miami, Florida, US: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-industry.59.
Shi, Zhengliang, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, and Zhaochun Ren. 2024. “Chain of Tools: Large Language Model Is an Automatic Multi-Tool Learner.” https://arxiv.org/abs/2405.16533.
Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. “Reflexion: Language Agents with Verbal Reinforcement Learning.” Advances in Neural Information Processing Systems 36. https://openreview.net/pdf?id=vAElhFcKW6.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. “Chain of Thought Prompting Elicits Reasoning in Large Language Models.” In Advances in Neural Information Processing Systems, edited by Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho. https://openreview.net/forum?id=_VjQlMeSB_J.
Welleck, Sean, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2023. “Generating Sequences by Learning to Self-Correct.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=hH36JeQZDaO.
Weng, Yixuan, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. 2023. “Large Language Models Are Better Reasoners with Self-Verification.” In Findings of the Association for Computational Linguistics: EMNLP 2023, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 2550–75. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.167.
Xu, Kaishuai, Tiezheng Yu, Wenjun Hou, Yi Cheng, Chak Tou Leong, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, and Wenjie Li. 2024. “Subtle Errors Matter: Preference Learning via Error-Injected Self-Editing.” ArXiv abs/2410.06638.
Yang, Ling, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin CUI. 2024. “Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models.” In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=ANO1i9JPtb.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” In Thirty-Seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=5Xc1ecxO1h.
Zhang, Yusen, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O Arik. 2024. “Chain of Agents: Large Language Models Collaborating on Long-Context Tasks.” In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=LuCLf4BJsr.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote