Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation Paper • 2410.05363 • Published 30 days ago • 44
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Paper • 2410.03290 • Published Oct 4 • 6
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling Paper • 2409.19291 • Published Sep 28 • 18