arxiv:2601.10305

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Published on Jan 15

· Submitted by

Kaicheng Yang on Jan 16

Upvote

Authors:

Hengyu Shen ,

Tiancheng Gu ,

Zelong Sun ,

Weidong Cai ,

Ziyong Feng ,

Kaicheng Yang

Abstract

A large-scale Chinese image-text dataset called DanQing is introduced to advance vision-language pretraining, demonstrating superior performance in various downstream tasks through continual pretraining of the SigLIP2 model.

AI-generated summary

Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.

View arXiv page View PDF Project page GitHub 13 Add to collection

Community

Kaichengalex

Paper author Paper submitter 1 day ago

thecupcutapk

about 9 hours ago

•

edited about 9 hours ago

This is a strong and timely contribution to Chinese vision-language pre-training research. DanQing directly addresses the long-standing bottleneck of high-quality Chinese image–text data, and the scale combined with rigorous filtering clearly sets it apart from existing resources. Building the dataset primarily from 2024–2025 web data is especially valuable, as it allows models to better reflect evolving language usage and real-world semantics. https://thecupcut.com/