sections/abstract.md · flax-community/multilingual-image-captioning at 16ddc80c168e55535503f57c17ed6d89d57c54a6

Abstract

This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.

Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the mBART large one-to-many model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.