GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding Paper • 2511.00810 • Published 9 days ago • 3
Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement Paper • 2511.01706 • Published 8 days ago • 2
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers Paper • 2511.01617 • Published 8 days ago • 2
UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings Paper • 2511.00405 • Published 10 days ago • 5
MotionStream: Real-Time Video Generation with Interactive Motion Controls Paper • 2511.01266 • Published 8 days ago • 24
EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities Paper • 2510.27545 • Published 11 days ago • 47
How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment Paper • 2511.01775 • Published 8 days ago • 6
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models Paper • 2511.01618 • Published 8 days ago • 9
left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles Paper • 2511.01340 • Published 8 days ago • 12
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published 12 days ago • 11
NaviTrace: Evaluating Embodied Navigation of Vision-Language Models Paper • 2510.26909 • Published 11 days ago • 13
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning Paper • 2511.01833 • Published 7 days ago • 15
MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models Paper • 2510.24794 • Published 15 days ago • 31
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum Paper • 2510.27571 • Published 11 days ago • 16
ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use Paper • 2510.27363 • Published 11 days ago • 22