(Papers) Multimodal - a Pretergeek Collection

Pretergeek 's Collections

OpenChat-3.5-0106 with Additional Layers

OpenChat-3.5-0106 with Extended Context

(Datasets) Long Context

(Datasets) Math

(Datasets) Openchat

(Papers) Long Context

(Papers) Embodied AI & Robotics

(Papers) Multimodal

(Papers) Multimodal

updated 6 days ago

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 47
Visual Instruction Tuning

Paper • 2304.08485 • Published Apr 17, 2023 • 13
Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 37
Making Large Multimodal Models Understand Arbitrary Visual Prompts

Paper • 2312.00784 • Published Dec 1, 2023 • 2
LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 40
GPT-4o System Card

Paper • 2410.21276 • Published 24 days ago • 79
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15 • 20
Pixtral 12B

Paper • 2410.07073 • Published Oct 9 • 60
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Paper • 2409.18125 • Published Sep 26 • 33
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4 • 54
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Paper • 2408.15881 • Published Aug 28 • 20
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10 • 40
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3 • 99
TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22 • 19
LEGO:Language Enhanced Multi-modal Grounding Model

Paper • 2401.06071 • Published Jan 11 • 10
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

Paper • 2401.02330 • Published Jan 4 • 14