Papers
arxiv:2509.25849

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

Published on Sep 30
Β· Submitted by Ziniu Li on Oct 2
Authors:
,
,
,
,
,
,
,

Abstract

An adaptive exploration budget allocation method for reinforcement learning in Large Language Models improves training efficiency and performance on mathematical reasoning benchmarks.

AI-generated summary

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

Community

Paper submitter

Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation πŸŽ’

Exploration in LLM training is important but costly. Insufficient exploration limits the model’s performance ceiling.

Current uniform exploration is both ineffective and inefficient:

  • easy tasks β†’ always solved β†’ 0 gradient
  • hard tasks β†’ always fail β†’ 0 gradient

Our idea: treat exploration as a knapsack problem.
πŸ‘‰ Allocate rollouts where they matter most.

Results:

  • +20–40% non-zero gradients
  • Up to 93 rollouts for hard tasks (w/o extra compute)
  • +2–4 avg points, +9 peak gains on math benchmarks
  • ~2Γ— cheaper than uniform allocation

Paper: https://www.arxiv.org/abs/2509.25849

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.25849 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.25849 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.25849 in a Space README.md to link it from this page.

Collections including this paper 12