MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI Paper • 2605.08678 • Published 23 days ago • 9
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning Paper • 2605.00347 • Published May 1 • 16
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks Paper • 2604.11753 • Published Apr 13 • 15
The PokeAgent Challenge: Competitive and Long-Context Learning at Scale Paper • 2603.15563 • Published Mar 16 • 11
Self-rewarding correction for mathematical reasoning Paper • 2502.19613 • Published Feb 26, 2025 • 82