Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking Paper β’ 2502.02339 β’ Published 13 days ago β’ 21
Ovis2 Collection Our latest advancement in multi-modal large language models (MLLMs) β’ 8 items β’ Updated about 13 hours ago β’ 37
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published 27 days ago β’ 93
VideoLLaMA3 Collection Frontier Multimodal Foundation Models for Video Understanding β’ 14 items β’ Updated 10 days ago β’ 11
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper β’ 2501.13106 β’ Published 26 days ago β’ 83
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper β’ 2501.12380 β’ Published 27 days ago β’ 82
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models Paper β’ 2501.03262 β’ Published Jan 4 β’ 90
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published Jan 1 β’ 99
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 41
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog β’ 9 items β’ Updated 7 days ago β’ 59
Inf-CL Collection The corresponding demos/checkpoints/papers/datasets of Inf-CL. β’ 2 items β’ Updated 24 days ago β’ 3
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper β’ 2410.23266 β’ Published Oct 30, 2024 β’ 20
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper β’ 2410.17243 β’ Published Oct 22, 2024 β’ 89
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective Paper β’ 2410.12490 β’ Published Oct 16, 2024 β’ 8
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper β’ 2410.12787 β’ Published Oct 16, 2024 β’ 31