YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Summary

BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks.

microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training -

  1. Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable.
  2. Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable.

General Document Benchmarks

Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance.

Model DocVQA
VAL
InfoVQA
VAL
DeepForm
TEST
KLC
TEST
WTQ
TEST
TabFact
TEST
ChartQA
TEST
TextVQA
VAL
MMMU
VAL
DudeMini
TEST
SlideVQA-M
TEST
TableVQA
TEST
Avg. Score
DocOwl1.5-8B (instruct) 80.73 49.94 68.84 37.99 38.87 79.67 68.56 68.91 33.67 34.64 31.62 52.60 53.84
DocOwl1.5-8B (base) 2.07 1.84 0.00 0.00 0.00 0.00 0.00 0.00 24.44 19.07 3.30 13.63 5.36
DocOwl1.5-8B (base) + DocStruct4M 75.99 46.88 62.77 35.21 32.86 71.56 68.36 65.08 33.67 29.00 27.03 46.27 49.56
DocOwl1.5-8B (base) + BigDocs (Ours) 78.70 47.62 64.39 36.93 35.69 72.65 65.80 67.30 32.33 32.55 29.60 49.03 51.05
Qwen2-VL-2B (instruct) 89.16 64.11 32.38 25.18 38.20 57.21 73.40 79.90 42.00 45.23 46.50 43.07 53.03
Qwen2-VL-2B (base) 7.26 0.78 0.00 0.00 0.00 0.00 0.00 1.14 34.89 28.43 14.55 0.00 7.25
Qwen2-VL-2B (base) + DocStruct4M 59.53 32.00 53.98 36.38 28.48 64.24 54.44 55.89 34.89 28.78 22.68 46.53 43.15
*Qwen2-VL-2B (base) + BigDocs (Ours) 57.23 31.88 49.31 34.39 31.61 64.75 68.60 61.01 35.67 27.19 17.46 47.53 43.89
Phi3.5-Vision-4B (instruct) 86.00 56.20 10.47 7.49 17.18 30.43 82.16 73.12 46.00 37.20 30.93 70.70 45.66
Phi3.5-Vision-4B + DocStruct4M 86.76 68.90 70.12 37.83 51.30 82.12 79.76 68.60 44.11 35.52 31.90 69.17 60.51
Phi3.5-Vision-4B + BigDocs (Ours) 87.05 70.05 70.97 37.45 51.21 81.24 81.56 68.72 45.00 36.15 32.47 67.77 60.80
LLaVA-NeXT-7B (instruct) 63.51 30.90 1.30 5.35 20.06 52.83 52.12 65.10 38.89 17.94 7.46 32.87 32.36
LLaVA-NeXT-7B + DocStruct4M 60.95 26.14 39.78 28.34 25.90 67.72 61.20 52.25 25.78 21.70 15.33 27.03 37.68
LLaVA-NeXT-7B + BigDocs (Ours) 57.13 24.47 46.38 31.09 27.06 72.58 54.72 49.06 17.78 22.88 16.07 33.13 37.70
Llama-3.2-90B 74.15* 48.71 4.18 1.81 24.20 63.01 11.36* 71.69 57.78 41.24 26.09 41.57 38.82
GPT-4o 20240806 92.80 66.37 38.39 29.92 46.63 81.10 85.70 70.46 69.10 54.55 67.58 72.87 64.62
Claude-3.5 Sonnet 88.48 59.05 31.41 24.82 47.13 53.48 51.84 71.42 64.78 35.11 0.00 81.27 50.73
GeminiPro-1.5 91.23 73.94 32.16 24.07 50.29 71.22 34.68 68.16 58.22 48.15 52.05 80.43 57.05
Qwen2-VL-72B 96.50 84.50 30.45 24.78 55.63 0.00 88.30 85.50 64.50 35.87 2.15 74.23 58.40

Input Formats

BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct:

Single image:

<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n

Multi-turn conversations:

<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n

For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows:

<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n 

Loading the model locally

After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference.

from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor 
model_id = "BigDocs/BigDocs-Phi-3.5-instruct"

# Note: set _attn_implementation='eager' if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  device_map="cuda", 
  trust_remote_code=True, 
  torch_dtype="auto", 
  _attn_implementation='flash_attention_2'    
)
# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id, 
  trust_remote_code=True, 
  num_crops=4
) 

images = []
placeholder = ""

# Note: if OOM, you might consider reduce number of frames in this example.
for i in range(1,20):
    url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" 
    images.append(Image.open(requests.get(url, stream=True).raw))
    placeholder += f"<|image_{i}|>\n"
messages = [
    {"role": "user", "content": placeholder+"Summarize the deck of slides."},
]
prompt = processor.tokenizer.apply_chat_template(
  messages, 
  tokenize=False, 
  add_generation_prompt=True
)

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") 
generation_args = { 
    "max_new_tokens": 1000, 
    "temperature": 0.0, 
    "do_sample": False, 
} 
generate_ids = model.generate(**inputs, 
  eos_token_id=processor.tokenizer.eos_token_id, 
  **generation_args
)

# remove input tokens 
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, 
  skip_special_tokens=True, 
  clean_up_tokenization_spaces=False)[0] 

print(response)
Downloads last month
29
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.