Abstract
LLM agents can perform code reasoning tasks like patch verification, fault localization, and code QA with improved accuracy through structured semi-formal reasoning that requires explicit premises and formal conclusions, without requiring code execution.
Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Agentic Rubrics as Contextual Verifiers for SWE Agents (2026)
- OmniCode: A Benchmark for Evaluating Software Engineering Agents (2026)
- FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight (2026)
- SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair (2026)
- TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code (2026)
- Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification (2026)
- Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper