--- license: mit language: - en base_model: - yuchenxie/CLiP - yuchenxie/GPT-2 library_name: transformers --- # GPT-2V Model Card ## Model Overview **GPT-2V** is a multimodal transformer model that combines the **CLIP** model (vision) and **GPT-2** (text generation) to generate responses based on both textual and visual inputs. This model leverages the strengths of CLIP for image understanding and GPT-2 for language generation, allowing for creative and context-aware outputs based on images and text. The model is designed to extend GPT-2's capabilities by incorporating image features through learned projection layers. ### Model Architecture - **Model Type**: arlow\_gpt - **Base Vision Model**: CLIP (yuchenxie/CLiP) - **Base Text Model**: GPT-2 (yuchenxie/GPT-2) - **Config**: Custom configuration for merging vision and text modalities. - **Tokenizer**: GPT-2 Tokenizer ### Key Features - **Multimodal Input**: Takes both text and image as inputs. - **Text Generation**: Produces creative and context-specific language outputs. - **Vision-Text Fusion**: Combines features from both vision and text for enhanced generation quality. ## Merging Script The following script merges **CLIP** and **GPT-2** (**safetensor model variants made by Yuchen under yuchenxie/CLiP and yuchenxie/GPT-2**) models into a single multimodal model, **GPT-2V**. This script saves the combined model along with the necessary configuration and tokenizer files for easy loading. ```python import os import json import shutil from pathlib import Path from typing import Dict, Any, Optional, Union import torch import torch.nn as nn from transformers import ( CLIPModel, GPT2Model, CLIPProcessor, GPT2Tokenizer, PretrainedConfig, PreTrainedModel, AutoConfig, AutoModelForCausalLM, AutoTokenizer ) from safetensors.torch import save_file class ArlowGPTConfig(PretrainedConfig): model_type = "arlow_gpt" def __init__( self, clip_model_name: str = "yuchenxie/CLiP", gpt2_model_name: str = "yuchenxie/GPT-2", clip_config: Optional[Dict] = None, gpt2_config: Optional[Dict] = None, projection_dim: int = 768, vocab_size: int = 50257, **kwargs ): super().__init__(**kwargs) self.clip_model_name = clip_model_name self.gpt2_model_name = gpt2_model_name self.clip_config = clip_config self.gpt2_config = gpt2_config self.projection_dim = projection_dim self.vocab_size = vocab_size class ArlowGPT(PreTrainedModel): config_class = ArlowGPTConfig def __init__(self, config: ArlowGPTConfig): super().__init__(config) # Load the models self.clip = CLIPModel.from_pretrained(config.clip_model_name) self.gpt2 = GPT2Model.from_pretrained(config.gpt2_model_name) # Projection layers self.feature_projection = nn.Linear( self.clip.vision_model.config.hidden_size + self.gpt2.config.hidden_size, config.projection_dim ) self.output_projection = nn.Linear( config.projection_dim, config.vocab_size ) def forward( self, input_ids: torch.Tensor, attention_mask: torch.Tensor, pixel_values: torch.Tensor, labels: Optional[torch.Tensor] = None, return_dict: bool = True, ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]: vision_outputs = self.clip.get_image_features(pixel_values=pixel_values) text_outputs = self.gpt2( input_ids=input_ids, attention_mask=attention_mask ).last_hidden_state batch_size = text_outputs.shape[0] seq_length = text_outputs.shape[1] vision_features = vision_outputs.unsqueeze(1).expand( batch_size, seq_length, -1 ) combined_features = torch.cat( [vision_features, text_outputs], dim=-1 ) projected_features = self.feature_projection(combined_features) logits = self.output_projection(projected_features) loss = None if labels is not None: loss_fct = nn.CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1)) if return_dict: return { "loss": loss, "logits": logits } return logits @staticmethod def register_auto_classes(): """Register the model with Auto* classes.""" try: AutoConfig.register("arlow_gpt", ArlowGPTConfig) AutoModelForCausalLM.register(ArlowGPTConfig, ArlowGPT) except ValueError: # Already registered pass def save_merged_model( model: ArlowGPT, output_dir: str, model_name: str = "merged_model" ) -> None: """Save the merged model with all necessary components in standard format.""" output_path = Path(output_dir) # Remove existing directory if it exists if output_path.exists(): shutil.rmtree(output_path) # Create new directory output_path.mkdir(parents=True) # Register auto classes model.register_auto_classes() # Save the model model.save_pretrained(output_path) # Save tokenizer and processor tokenizer = GPT2Tokenizer.from_pretrained(model.config.gpt2_model_name) processor = CLIPProcessor.from_pretrained(model.config.clip_model_name) tokenizer.save_pretrained(output_path) processor.save_pretrained(output_path) def main(): clip_model = "yuchenxie/CLiP" gpt2_model = "yuchenxie/GPT-2" output_dir = "merged_model" print("Initializing merged model...") config = ArlowGPTConfig( clip_model_name=clip_model, gpt2_model_name=gpt2_model ) model = ArlowGPT(config) print("Saving merged model...") save_merged_model(model, output_dir) print(f"Merged model saved to {output_dir}") print("Saved files:") for file in os.listdir(output_dir): print(f"- {file}") if __name__ == "__main__": main() ``` ## License The usage of this model is subject to the same licensing as the original CLIP and GPT-2 models used for merging. Please refer to the license agreements provided by OpenAI and the respective contributors for further details. ## Citation If you use **GPT-2V** in your research or application, please cite the original works of CLIP and GPT-2, along with this model card.