Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning Paper • 2504.21561 • Published Apr 30 • 1
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL Paper • 2505.15436 • Published May 21 • 2
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation Paper • 2509.23866 • Published 12 days ago • 10
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search Paper • 2509.07969 • Published about 1 month ago • 59
TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models Paper • 2506.03099 • Published Jun 3 • 19
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials Paper • 2504.12679 • Published Apr 17 • 1
TongUI Collection Open source our work TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials; https://github.com/TongUI-agent/TongUI-agent • 10 items • Updated Jul 1 • 3