VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published Jan 22 • 90
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 129
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Paper • 2303.09867 • Published Mar 17, 2023
Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation Paper • 2303.13399 • Published Mar 23, 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Paper • 2303.14369 • Published Mar 25, 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment Paper • 2305.12218 • Published May 20, 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Paper • 2311.08046 • Published Nov 14, 2023 • 2
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Paper • 2311.10122 • Published Nov 16, 2023 • 27
Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting Paper • 2312.13271 • Published Dec 20, 2023 • 6
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29, 2024 • 53
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Paper • 2402.05935 • Published Feb 8, 2024 • 17
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference Paper • 2406.18139 • Published Jun 26, 2024 • 2
FreestyleRet: Retrieving Images from Style-Diversified Queries Paper • 2312.02428 • Published Dec 5, 2023
Parallel Vertex Diffusion for Unified Visual Grounding Paper • 2303.07216 • Published Mar 13, 2023
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts Paper • 2410.07348 • Published Oct 9, 2024 • 1
MoH: Multi-Head Attention as Mixture-of-Head Attention Paper • 2410.11842 • Published Oct 15, 2024 • 22
WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation Paper • 2306.10750 • Published Jun 19, 2023