Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published Sep 19 • 135
Power-LM Collection Dense & MoE LLMs trained with power learning rate scheduler. • 4 items • Updated Oct 17 • 15
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents Paper • 2408.07199 • Published Aug 13 • 20
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery Paper • 2408.06292 • Published Aug 12 • 115
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes Paper • 2306.13649 • Published Jun 23, 2023 • 16 • 2
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning Paper • 2407.18248 • Published Jul 25 • 31 • 4
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF Paper • 2405.21046 • Published May 31 • 3