arxiv:2410.01215

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Published on Oct 2

· Submitted by

YerbaPage on Oct 3

#3 Paper of the day

Upvote

Authors:

Yuling Shi ,

Xiaodong Gu

Abstract

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

View arXiv page View PDF Add to collection

Community

YerbaPage

Paper author Paper submitter Oct 3

•

edited Oct 3

MGDebugger, a hierarchical bottom-up LLM code debugger 🔥 that can fix bugs from low-level syntax errors to high-level algorithmic flaws.

It achieves an ⭐️ 18.9% improvement in accuracy over seed generations in HumanEval and a ⭐️ 97.6% repair success rate in HumanEvalFix.

Code and demo available at https://github.com/YerbaPage/MGDebugger.

dasisterin

Oct 3

Brilliant 🤗

MichaelBarryUK

Oct 3

Approximately, what is the overhead? I.e the ratio between the subtotal tokens (finished code) and total tokens (debugging steps + finished code)

YerbaPage

Paper author Oct 4

Great question 👍

Most debugging methods like Self-Debugging, LDB, Reflexion, etc., tend to have a high ratio of debugging tokens to finished code tokens (often > 5), as they perform extensive analyses to identify and resolve bugs. Despite this, they sometimes struggle to detect and fix subtle issues.

In our approach, MGDebugger might incur slightly higher token costs due to the hierarchical decomposition process, where we isolate and debug subfunctions separately. However, the method's effectiveness justifies this overhead since it addresses errors at multiple levels of granularity, allowing it to debug issues that other methods might overlook.

Hope that clarifies things!