Engage in multi-modal conversations with images and videos
Generate text by combining an image and a question