-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 67 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 125 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85
Collections
Discover the best community collections!
Collections including paper arxiv:2409.18125
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 31 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 25 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 121 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 21
-
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Paper • 2409.18124 • Published • 31 -
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Paper • 2409.18125 • Published • 33 -
Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices
Paper • 2410.11795 • Published • 16 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85
-
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Paper • 2409.02889 • Published • 54 -
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis
Paper • 2409.07129 • Published • 6 -
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Paper • 2409.18125 • Published • 33
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 24 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 29 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 30 -
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 29
-
TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion
Paper • 2401.09416 • Published • 9 -
SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild
Paper • 2401.10171 • Published • 12 -
DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model
Paper • 2311.09217 • Published • 21 -
GALA: Generating Animatable Layered Assets from a Single Scan
Paper • 2401.12979 • Published • 6
-
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Paper • 2401.04092 • Published • 20 -
AToM: Amortized Text-to-Mesh using 2D Diffusion
Paper • 2402.00867 • Published • 10 -
Advances in 3D Generation: A Survey
Paper • 2401.17807 • Published • 17 -
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Paper • 2401.09340 • Published • 18
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 14 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 45 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 40