Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Abstract
UniFilter, a Unified Multimodal Data Quality Classifier, enhances Multimodal Large Language Models (MLLMs) by filtering high-quality image-text and interleaved data, leading to improved zero-shot reasoning and in-context learning.
The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.
Community
A Unified Multimodal Data Quality Classifier to curate high-quality image-text caption and interleaved data
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer (2025)
- MobileCLIP2: Improving Multi-Modal Reinforced Training (2025)
- Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval (2025)
- LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training (2025)
- Growing Visual Generative Capacity for Pre-Trained MLLMs (2025)
- WAVE: Learning Unified&Versatile Audio-Visual Embeddings with Multimodal LLM (2025)
- Efficient Multimodal Dataset Distillation via Generative Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper