arxiv:2510.00915

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Published on Oct 1

Authors:

Xin-Qiang Cai ,

Abstract

RLVR systems use automated verifiers to train policies with binary rewards, addressing verifier unreliability through backward and forward correction algorithms to improve policy training and convergence.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary {0,1} during training. This choice carries a cost: it introduces false negatives (rejecting correct answers, FNs) and false positives (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction 12{36} as wrong when compared against the canonical 1{3} due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a backward correction that de-biases the observed binary reward to recover an unbiased estimator of the clean policy gradient. The second is a forward correction that reweights score-function terms so that the expected update direction aligns with the clean gradient; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.00915 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.00915 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.00915 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.