Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published Sep 19, 2024 • 136
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL Paper • 2403.03950 • Published Mar 6, 2024 • 15
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes Paper • 2306.13649 • Published Jun 23, 2023 • 18