VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Paper • 2412.21059 • Published 10 days ago • 17
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published 25 days ago • 49
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Paper • 2412.18525 • Published 16 days ago • 65
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System Paper • 2412.20005 • Published 13 days ago • 17
Slow Perception: Let's Perceive Geometric Figures Step-by-step Paper • 2412.20631 • Published 11 days ago • 13
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Paper • 2412.18609 • Published 16 days ago • 15
Progressive Multimodal Reasoning via Active Retrieval Paper • 2412.14835 • Published 21 days ago • 71
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 15
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper • 2409.20566 • Published Sep 30, 2024 • 55
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper • 2405.09818 • Published May 16, 2024 • 127
Do Vision and Language Models Share Concepts? A Vector Space Alignment Study Paper • 2302.06555 • Published Feb 13, 2023 • 9
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3, 2024 • 65
Improving fine-grained understanding in image-text pre-training Paper • 2401.09865 • Published Jan 18, 2024 • 16
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model Paper • 2401.09417 • Published Jan 17, 2024 • 59