metadata

license: mit
language:
  - en
base_model:
  - yuchenxie/CLiP
  - yuchenxie/GPT-2
library_name: transformers
inference: true

GPT-2V Model Card

Model Overview

GPT-2V is a multimodal transformer model that combines the CLIP model (vision) and GPT-2 (text generation) to generate responses based on both textual and visual inputs. This model leverages the strengths of CLIP for image understanding and GPT-2 for language generation, allowing for creative and context-aware outputs based on images and text. The model is designed to extend GPT-2's capabilities by incorporating image features through learned projection layers.

Model Architecture

Model Type: arlow_gpt
Base Vision Model: CLIP (yuchenxie/CLiP)
Base Text Model: GPT-2 (yuchenxie/GPT-2)
Config: Custom configuration for merging vision and text modalities.
Tokenizer: GPT-2 Tokenizer

Key Features

Multimodal Input: Takes both text and image as inputs.
Text Generation: Produces creative and context-specific language outputs.
Vision-Text Fusion: Combines features from both vision and text for enhanced generation quality.

Merging Script

The following script merges CLIP and GPT-2 (safetensor model variants made by Yuchen under yuchenxie/CLiP and yuchenxie/GPT-2) models into a single multimodal model, GPT-2V. This script saves the combined model along with the necessary configuration and tokenizer files for easy loading.

import os
import json
import shutil
from pathlib import Path
from typing import Dict, Any, Optional, Union

import torch
import torch.nn as nn
from transformers import (
    CLIPModel,
    GPT2Model,
    CLIPProcessor,
    GPT2Tokenizer,
    PretrainedConfig,
    PreTrainedModel,
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer
)
from safetensors.torch import save_file

class ArlowGPTConfig(PretrainedConfig):
    model_type = "arlow_gpt"
    
    def __init__(
        self,
        clip_model_name: str = "yuchenxie/CLiP",
        gpt2_model_name: str = "yuchenxie/GPT-2",
        clip_config: Optional[Dict] = None,
        gpt2_config: Optional[Dict] = None,
        projection_dim: int = 768,
        vocab_size: int = 50257,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.clip_model_name = clip_model_name
        self.gpt2_model_name = gpt2_model_name
        self.clip_config = clip_config
        self.gpt2_config = gpt2_config
        self.projection_dim = projection_dim
        self.vocab_size = vocab_size

class ArlowGPT(PreTrainedModel):
    config_class = ArlowGPTConfig
    
    def __init__(self, config: ArlowGPTConfig):
        super().__init__(config)
        
        # Load the models
        self.clip = CLIPModel.from_pretrained(config.clip_model_name)
        self.gpt2 = GPT2Model.from_pretrained(config.gpt2_model_name)

        # Projection layers
        self.feature_projection = nn.Linear(
            self.clip.vision_model.config.hidden_size + self.gpt2.config.hidden_size,
            config.projection_dim
        )

        self.output_projection = nn.Linear(
            config.projection_dim,
            config.vocab_size
        )

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        pixel_values: torch.Tensor,
        labels: Optional[torch.Tensor] = None,
        return_dict: bool = True,
    ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:
        vision_outputs = self.clip.get_image_features(pixel_values=pixel_values)
        text_outputs = self.gpt2(
            input_ids=input_ids,
            attention_mask=attention_mask
        ).last_hidden_state

        batch_size = text_outputs.shape[0]
        seq_length = text_outputs.shape[1]

        vision_features = vision_outputs.unsqueeze(1).expand(
            batch_size, seq_length, -1
        )

        combined_features = torch.cat(
            [vision_features, text_outputs],
            dim=-1
        )

        projected_features = self.feature_projection(combined_features)
        logits = self.output_projection(projected_features)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))

        if return_dict:
            return {
                "loss": loss,
                "logits": logits
            }
        return logits

    @staticmethod
    def register_auto_classes():
        """Register the model with Auto* classes."""
        try:
            AutoConfig.register("arlow_gpt", ArlowGPTConfig)
            AutoModelForCausalLM.register(ArlowGPTConfig, ArlowGPT)
        except ValueError:
            # Already registered
            pass

def save_merged_model(
    model: ArlowGPT,
    output_dir: str,
    model_name: str = "merged_model"
) -> None:
    """Save the merged model with all necessary components in standard format."""
    output_path = Path(output_dir)

    # Remove existing directory if it exists
    if output_path.exists():
        shutil.rmtree(output_path)

    # Create new directory
    output_path.mkdir(parents=True)

    # Register auto classes
    model.register_auto_classes()

    # Save the model
    model.save_pretrained(output_path)

    # Save tokenizer and processor
    tokenizer = GPT2Tokenizer.from_pretrained(model.config.gpt2_model_name)
    processor = CLIPProcessor.from_pretrained(model.config.clip_model_name)

    tokenizer.save_pretrained(output_path)
    processor.save_pretrained(output_path)

def main():
    clip_model = "yuchenxie/CLiP"
    gpt2_model = "yuchenxie/GPT-2"
    output_dir = "merged_model"

    print("Initializing merged model...")
    config = ArlowGPTConfig(
        clip_model_name=clip_model,
        gpt2_model_name=gpt2_model
    )
    model = ArlowGPT(config)

    print("Saving merged model...")
    save_merged_model(model, output_dir)
    print(f"Merged model saved to {output_dir}")
    print("Saved files:")
    for file in os.listdir(output_dir):
        print(f"- {file}")

if __name__ == "__main__":
    main()

License

The usage of this model is subject to the same licensing as the original CLIP and GPT-2 models used for merging. Please refer to the license agreements provided by OpenAI and the respective contributors for further details.

Citation

If you use GPT-2V in your research or application, please cite the original works of CLIP and GPT-2, along with this model card.