Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published 2 days ago • 28
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published 8 days ago • 44
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs Paper • 2411.14199 • Published 2 days ago • 19
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper • 2411.14432 • Published 2 days ago • 13
Vision/multimodal Models Collection Collection of the most popular vision models including Llama 3.2, LlaVa, Qwen2 VL, Pixtral, PaliGemma and more! • 22 items • Updated 2 days ago • 4
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper • 2411.13281 • Published 3 days ago • 15
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration Paper • 2411.10958 • Published 6 days ago • 41
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Paper • 2411.10161 • Published 8 days ago • 6
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published 4 days ago • 41
AnimateAnything: Consistent and Controllable Animation for Video Generation Paper • 2411.10836 • Published 7 days ago • 18
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Paper • 2411.10640 • Published 8 days ago • 39
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper • 2411.06558 • Published 13 days ago • 29
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use Paper • 2411.10323 • Published 8 days ago • 26
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published 8 days ago • 93
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation Paper • 2411.08033 • Published 11 days ago • 21
Thinking LLMs: General Instruction Following with Thought Generation Paper • 2410.10630 • Published Oct 14 • 16
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Paper • 2411.09595 • Published 9 days ago • 66
MagicQuill: An Intelligent Interactive Image Editing System Paper • 2411.09703 • Published 9 days ago • 52
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination Paper • 2411.03823 • Published 17 days ago • 43