Papers
arxiv:2002.06353

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Published on Feb 15, 2020
Authors:
,
Lei Ji ,
,
,
,
,
,
,

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2002.06353 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2002.06353 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2002.06353 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.