|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
--- |
|
# Model Card for MMICL |
|
|
|
## Temporal Demo for MMICL |
|
[Playground for MMICL-FLANT5XXL](https://60b00a16a2f9f59cc1.gradio.live/) |
|
support multi-image input as well as video input. |
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
## Model Details |
|
**MMICL(Multi-Modal In-Context Learning)** is a multimodal vision-language model that incorporates blip2/instrcutblip. |
|
It has the ability to analyze and understand multiple images, as well as follow instructions. |
|
|
|
|
|
### Model Description |
|
MMICL outperforms the VL model of the same size and performs exceptionally well on complex visual reasoning datasets. |
|
Till 21st Aug. 2023, it achieves **state-of-the-art** performance on both multimodal task leaderboards and a wide range of vision-language tasks. |
|
Furthermore, it showcases new capabilities in video understanding and multimodal in-context learning (M-ICL). |
|
+ **Capability of multiple images refering and reasoning** |
|
|
|
+ **Manually constructed In-context instruction tuning dataset** |
|
|
|
+ Till 21st Aug. 2023 **1st on [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), 1st on [MMBench](https://opencompass.org.cn/leaderboard-multimodal)** |
|
|
|
+ Visual Encoder: VIT-L from CLIP/ ViT-G/14 from EVA-CLIP |
|
|
|
+ Pre-trained LLM: FlanT5-XL/ FlanT5-XXL/ Vicuna-7B/ Vicuna-13B |
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** [More Information Needed] |
|
- **License:** MIT |
|
- **Finetuned from model :** [instructblip-flan-t5-xxl](https://huggingface.co/Salesforce/instructblip-flan-t5-xxl) |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [MMICL](https://github.com/HaozheZhao/MIC) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
``` |
|
# For T5 based model |
|
from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor |
|
import datasets |
|
import json |
|
import transformers |
|
from PIL import Image |
|
import torch |
|
from model.blip2 import Blip2Processor,Blip2ForConditionalGeneration |
|
from model.blip2 import Blip2Config |
|
model_type="instructblip" |
|
model_ckpt="BleachNick/MMICL-Instructblip-T5-xxl" |
|
|
|
if 'blip2' in model_type: |
|
model = Blip2ForConditionalGeneration.from_pretrained( |
|
model_ckpt, |
|
config=config).to('cuda:0',dtype=torch.bfloat16) |
|
elif 'instructblip' in model_type: |
|
model = InstructBlipForConditionalGeneration.from_pretrained( |
|
model_ckpt, |
|
config=config).to('cuda:0',dtype=torch.bfloat16) |
|
|
|
|
|
sp = ["图"]+[f"<image{i}>" for i in range(20)] |
|
|
|
processor = InstructBlipProcessor.from_pretrained( |
|
model_ckpt |
|
) |
|
# processor = Blip2Processor.from_pretrained( |
|
# model_ckpt |
|
# ) |
|
|
|
sp = sp+processor.tokenizer.additional_special_tokens[len(sp):] |
|
processor.tokenizer.add_special_tokens({'additional_special_tokens':sp}) |
|
|
|
|
|
prompt = ['Use the image 0: <image0>图,image 1: <image1>图 and image 2: <image2>图 as a visual aid to help you calculate the equation accurately. image 0 is 2+1=3.\nimage 1 is 5+6=11.\nimage 2 is"'] |
|
|
|
prompt = " ".join(prompt) |
|
|
|
inputs = processor(images=images, text=prompt, return_tensors="pt") |
|
|
|
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16) |
|
inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]]) |
|
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0) |
|
|
|
inputs = inputs.to('cuda:0') |
|
outputs = model.generate( |
|
pixel_values = inputs['pixel_values'], |
|
input_ids = inputs['input_ids'], |
|
attention_mask = inputs['attention_mask'], |
|
img_mask = inputs['img_mask'] |
|
) |
|
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip() |
|
print(generated_text) |
|
|
|
``` |
|
|
|
#### |
|
Training Hyperparameters |
|
|
|
- **Training regime:** [fp32, bf16 mixed precision, bf16 non-mixed precision] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
|
|
|
|
|
|
|