Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward Paper • 2511.20561 • Published 11 days ago • 31
AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies Paper • 2508.08113 • Published Aug 11 • 11
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens Paper • 2510.02292 • Published Oct 2 • 1
Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry Paper • 2510.25595 • Published Oct 29
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3 • 31
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy Paper • 2510.13778 • Published Oct 15 • 16
SemiReward: A General Reward Model for Semi-supervised Learning Paper • 2310.03013 • Published Oct 4, 2023 • 2
Switch EMA: A Free Lunch for Better Flatness and Sharpness Paper • 2402.09240 • Published Feb 14, 2024 • 5
OpenMixup: Open Mixup Toolbox and Benchmark for Visual Representation Learning Paper • 2209.04851 • Published Sep 11, 2022 • 3
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models Paper • 2510.12784 • Published Oct 14 • 19
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation Paper • 2506.21876 • Published Jun 27 • 28
Can Vision Language Models Infer Human Gaze Direction? A Controlled Study Paper • 2506.05412 • Published Jun 4 • 4
4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time Paper • 2506.18890 • Published Jun 23 • 6
MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection Paper • 2404.04910 • Published Apr 7, 2024
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models Paper • 2503.04240 • Published Mar 6
Science-T2I: Addressing Scientific Illusions in Image Synthesis Paper • 2504.13129 • Published Apr 17 • 3
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark Paper • 2504.14693 • Published Apr 20
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments Paper • 2503.08604 • Published Mar 11
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published May 29 • 14
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13 • 23