Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses
Abstract
Bayesian-Agent presents a framework that treats reusable skills and SOPs as hypotheses for model success, using Bayesian inference to guide agent behavior and improve task performance through posterior-guided harness optimization.
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.
Community
Bayesian vs. Frequentist for Skill Evolving: Injecting a Cumulative, Auditable, and Transferable Belief State
The greatest advantage of the Bayesian approach for skill evolving is that it goes beyond the stateless "observe failure → patch" cycle. Instead, it injects a cumulative, auditable, and transferable belief state into the entire process — each skill's reliability is no longer a simple frequency statistic (e.g., 1/1 = 100%), but a full belief distribution with priors, posteriors, and quantified uncertainty. This allows the agent to remain robust when data is scarce, transfer prior knowledge when the environment changes, and keep every update traceable and explainable — whereas the frequentist approach remains stuck at the level of "count-from-zero, point-estimate, memoryless" patching.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision (2026)
- SkillOpt: Executive Strategy for Self-Evolving Agent Skills (2026)
- SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History (2026)
- Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents (2026)
- SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories (2026)
- Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents (2026)
- Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.08348 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper