Multimodal Language Model
What does matter besides data receipt when training a Multimodal language model?
- Paper • 2408.03326 • Published • 60
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 40PaliGemma: A versatile 3B VLM for transfer
Paper • 2407.07726 • Published • 68openbmb/MiniCPM-V-2_6
Image-Text-to-Text • Updated • 65.3k • 911xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • 2408.08872 • Published • 98MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 126OpenGVLab/InternViT-6B-448px-V1-2
Image Feature Extraction • Updated • 219 • 27How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 56LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper • 2403.11703 • Published • 17EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Paper • 2406.20076 • Published • 9
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 58Note 1. The intra-image bidirectional attention is important, and replacing it with causal attention hurts text-to-image generation. 2. There is a clear advantage to using the U-Net up and down blocks instead of a simple linear layer for modality mapping.
LISA: Reasoning Segmentation via Large Language Model
Paper • 2308.00692 • Published • 1Note 1. Extract the feature of token from the last hidden layer of LLM and project to SAM decoder. 2. Joint train with pixel-level understanding data often leads to decreased image-level capability.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Paper • 2406.19389 • Published • 53Note 1. Image Encoder: a ConvNeXt-L-based CLIP model to reach high resolution. 2. Directly combining a frozen perception module with LLM doesn’t perform well. 3. Use a simple MLP to map the LLM output’s hidden states of the [SEG] token to the visual space. 4. Propose a good Region Encoder Design adapted from a pre-trained Image-Encoder. 5. “ Expression [SEG]." Since the “Expression" is flexible and variable, the LLM is less likely to overfit to a fixed response
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Paper • 2407.15841 • Published • 40Note 1. A Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible. 2. A Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. 3. Concate them together and bang, here we have a good video features even without training.
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Paper • 2311.05698 • Published • 9Note 1. Scale to 512 input video frames with the Token Turning Machine Combiner. 2. The ‘Process’ is implemented with a standard Transformer with layers of MHA and MLPs. The functions ‘Read’, ‘Write’, and ‘Output’ is implemented with Attention Pooling.
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 124Note 1. Training methods: 1.1 progressively higher-quality data, the maximum image resolution gradually increases, and more model parts are unfrozen. 2. Dataset 2.1 Apply image deduplication, it is possible to train on just half of the LAION dataset with only a minimal reduction in performance compared to using the full dataset
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 85Note 1. Unfreezing the CLIP encoder significantly improves when interpolating to a higher MLLM input resolution that differs from the pre-training resolution. 2. Introduce a Pre-Alignment training stage: 2.1 Traini each pre-trained vision expert with their own projector on SFT data, while keeping the language model frozen,
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 73allenai/Molmo-7B-D-0924
Image-Text-to-Text • Updated • 585k • 496meta-llama/Llama-3.2-11B-Vision-Instruct
Image-Text-to-Text • Updated • 2.39M • • 1.24k
Video Instruction Tuning With Synthetic Data
Paper • 2410.02713 • Published • 39Note 1. Arrange slow and fast frames in an interleaving pattern. p × p pooling and 2p × 2p pooling for slow and fast frames, respectively 2. Use a tagging model to categorize the video content; InsTag (https://arxiv.org/pdf/2308.07074)
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Paper • 2410.16267 • Published • 17Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Paper • 2410.17434 • Published • 25Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 43
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper • 2412.03555 • Published • 124Note uses the prefix "detect all classes" and provides box coordinates and class names for all annotated objects in random order in the target sequence (suffix). To reach the maximum sequence length, noise boxes with random coordinates and a token as the class name are added. No loss is applied to the noise box coordinates during training, but the class tokens are treated with loss as usual.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 139Note 1. While the Qwen2-1.5B and Qwen1.5-4B variants had similar performance, the 4B Qwen1.5-4B was still more correlated than the 1.5B model. 2. 500K samples is sufficient for moderately sized models (2–4 B) to reliably transfer design insights to larger models
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 128Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Paper • 2501.04001 • Published • 40