SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? Paper • 2507.12415 • Published Jul 16 • 41
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once Paper • 2507.10541 • Published Jul 14 • 29
DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning Paper • 2506.16012 • Published Jun 19 • 22