VLM (Cartoon Captioning)

VLM (Cartoon Captioning) is a VLM trained as part of the AnyModal framework to generate captions for cartoon images 🎨. The model integrates a Vision Transformer (ViT) encoder with a large language model to produce descriptive and context-aware captions for cartoons, specifically from the New Yorker Cartoon Caption Contest dataset. The weights of the projector network that project the image embeddings from ViT to Llama 3.2-1B are available here.

Trained On

This model was trained on the New Yorker Cartoon Caption Contest Dataset:

Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest
Jack Hessel, Ana Marasović, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, Yejin Choi

How to Use

Installation

Install the required dependencies:

pip install torch transformers torchvision huggingface_hub tqdm matplotlib Pillow

Inference

The model can be used to generate captions for cartoon images. Below is an example workflow:

import llm
import anymodal
import torch
import vision
from torch.utils.data import DataLoader
import numpy as np
import os
from PIL import Image
from huggingface_hub import hf_hub_download

# Load language model and tokenizer
llm_tokenizer, llm_model = llm.get_llm(
    "meta-llama/Llama-3.2-1B",
    access_token="GET_YOUR_OWN_TOKEN_FROM_HUGGINGFACE",
    use_peft=False,
)
llm_hidden_size = llm.get_hidden_size(llm_tokenizer, llm_model)

# Load vision model components
image_processor, vision_model, vision_hidden_size = vision.get_image_encoder("google/vit-base-patch16-224", use_peft=False)

# Initialize vision tokenizer and encoder
vision_encoder = vision.VisionEncoder(vision_model)
vision_tokenizer = vision.Projector(vision_hidden_size, llm_hidden_size, num_hidden=1)

# Initialize MultiModalModel
multimodal_model = anymodal.MultiModalModel(
    input_processor=None,
    input_encoder=vision_encoder,
    input_tokenizer=vision_tokenizer,
    language_tokenizer=llm_tokenizer,
    language_model=llm_model,
    input_start_token="<|imstart|>",
    input_end_token="<|imend|>",
    prompt_text="The description of the given cartoon is: ",
)

# Download pre-trained model weights
if not os.path.exists("image_captioning_model"):
    os.makedirs("image_captioning_model")

hf_hub_download("AnyModal/VLM_Cartoon_Caption", filename="input_tokenizer.pt", local_dir="image_captioning_model")
multimodal_model._load_model("image_captioning_model")

# Generate caption for a cartoon image
image_path = "cartoon_example.jpg"  # Path to your cartoon image
image = Image.open(image_path).convert("RGB")
processed_image = image_processor(image, return_tensors="pt")
processed_image = {key: val.squeeze(0) for key, val in processed_image.items()}  # Remove batch dimension

# Generate caption
generated_caption = multimodal_model.generate(processed_image, max_new_tokens=120)
print("Generated Caption:", generated_caption)

Project and Training Scripts

This model is part of the AnyModal Image Captioning Project.

Training Script: train.py
Inference Script: inference.py

Explore the full project repository for more details and customization options.

Project Details

Vision Encoder: Pre-trained Vision Transformer (ViT) model for image feature extraction.
Projector Network: Maps image features into a token space compatible with the language model.
Language Model: Pre-trained causal language model for natural language generation.

The model was fine-tuned on the New Yorker cartoon dataset to generate humorous and contextually relevant captions, leveraging AnyModal's flexible multimodal framework.

AnyModal
/

VLM_Cartoon_Caption