Video understanding
updated
Wolf: Captioning Everything with a World Summarization Framework
Paper
• 2407.18908
• Published
• 32
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
• 2407.19985
• Published
• 37
TPDiff: Temporal Pyramid Video Diffusion Model
Paper
• 2503.09566
• Published
• 45
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware
Regressive GRPO
Paper
• 2506.07464
• Published
• 14
Video models are zero-shot learners and reasoners
Paper
• 2509.20328
• Published
• 100
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large
Multimodal Models
Paper
• 2510.05034
• Published
• 51
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
Paper
• 2510.20579
• Published
• 56
Video Reasoning without Training
Paper
• 2510.17045
• Published
• 8
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
• 2510.23473
• Published
• 85
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
• 2510.26802
• Published
• 34
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
• 2511.15065
• Published
• 77
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Paper
• 2511.16668
• Published
• 55
In-Video Instructions: Visual Signals as Generative Control
Paper
• 2511.19401
• Published
• 32
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Paper
• 2512.01342
• Published
• 18
ViDiC: Video Difference Captioning
Paper
• 2512.03405
• Published
• 28
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Paper
• 2512.04678
• Published
• 42
Evaluating Gemini Robotics Policies in a Veo World Simulator
Paper
• 2512.10675
• Published
• 20
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
Paper
• 2512.13874
• Published
• 17
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Paper
• 2512.15702
• Published
• 16
Kling-Omni Technical Report
Paper
• 2512.16776
• Published
• 170
SemanticGen: Video Generation in Semantic Space
Paper
• 2512.20619
• Published
• 93
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Paper
• 2512.20618
• Published
• 55
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
Paper
• 2512.21004
• Published
• 13
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Paper
• 2601.10553
• Published
• 12
Rethinking Video Generation Model for the Embodied World
Paper
• 2601.15282
• Published
• 43
Self-Refining Video Sampling
Paper
• 2601.18577
• Published
• 25
RISE-Video: Can Video Generators Decode Implicit World Rules?
Paper
• 2602.05986
• Published
• 26
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Paper
• 2602.16705
• Published
• 26
RynnBrain: Open Embodied Foundation Models
Paper
• 2602.14979
• Published
• 42
A Very Big Video Reasoning Suite
Paper
• 2602.20159
• Published
• 494