BigDocs/BigDocs-Phi-3.5-instruct

Model Summary

BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks.

microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training -

Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable.
Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable.

General Document Benchmarks

Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance.

Model	DocVQA VAL	InfoVQA VAL	DeepForm TEST	KLC TEST	WTQ TEST	TabFact TEST	ChartQA TEST	TextVQA VAL	MMMU VAL	DudeMini TEST	SlideVQA-M TEST	TableVQA TEST	Avg. Score
DocOwl1.5-8B (instruct)	80.73	49.94	68.84	37.99	38.87	79.67	68.56	68.91	33.67	34.64	31.62	52.60	53.84
DocOwl1.5-8B (base)	2.07	1.84	0.00	0.00	0.00	0.00	0.00	0.00	24.44	19.07	3.30	13.63	5.36
DocOwl1.5-8B (base) + DocStruct4M	75.99	46.88	62.77	35.21	32.86	71.56	68.36	65.08	33.67	29.00	27.03	46.27	49.56
DocOwl1.5-8B (base) + BigDocs (Ours)	78.70	47.62	64.39	36.93	35.69	72.65	65.80	67.30	32.33	32.55	29.60	49.03	51.05
Qwen2-VL-2B (instruct)	89.16	64.11	32.38	25.18	38.20	57.21	73.40	79.90	42.00	45.23	46.50	43.07	53.03
Qwen2-VL-2B (base)	7.26	0.78	0.00	0.00	0.00	0.00	0.00	1.14	34.89	28.43	14.55	0.00	7.25
Qwen2-VL-2B (base) + DocStruct4M	59.53	32.00	53.98	36.38	28.48	64.24	54.44	55.89	34.89	28.78	22.68	46.53	43.15
*Qwen2-VL-2B (base) + BigDocs (Ours)	57.23	31.88	49.31	34.39	31.61	64.75	68.60	61.01	35.67	27.19	17.46	47.53	43.89
Phi3.5-Vision-4B (instruct)	86.00	56.20	10.47	7.49	17.18	30.43	82.16	73.12	46.00	37.20	30.93	70.70	45.66
Phi3.5-Vision-4B + DocStruct4M	86.76	68.90	70.12	37.83	51.30	82.12	79.76	68.60	44.11	35.52	31.90	69.17	60.51
Phi3.5-Vision-4B + BigDocs (Ours)	87.05	70.05	70.97	37.45	51.21	81.24	81.56	68.72	45.00	36.15	32.47	67.77	60.80
LLaVA-NeXT-7B (instruct)	63.51	30.90	1.30	5.35	20.06	52.83	52.12	65.10	38.89	17.94	7.46	32.87	32.36
LLaVA-NeXT-7B + DocStruct4M	60.95	26.14	39.78	28.34	25.90	67.72	61.20	52.25	25.78	21.70	15.33	27.03	37.68
LLaVA-NeXT-7B + BigDocs (Ours)	57.13	24.47	46.38	31.09	27.06	72.58	54.72	49.06	17.78	22.88	16.07	33.13	37.70
Llama-3.2-90B	74.15*	48.71	4.18	1.81	24.20	63.01	11.36*	71.69	57.78	41.24	26.09	41.57	38.82
GPT-4o 20240806	92.80	66.37	38.39	29.92	46.63	81.10	85.70	70.46	69.10	54.55	67.58	72.87	64.62
Claude-3.5 Sonnet	88.48	59.05	31.41	24.82	47.13	53.48	51.84	71.42	64.78	35.11	0.00	81.27	50.73
GeminiPro-1.5	91.23	73.94	32.16	24.07	50.29	71.22	34.68	68.16	58.22	48.15	52.05	80.43	57.05
Qwen2-VL-72B	96.50	84.50	30.45	24.78	55.63	0.00	88.30	85.50	64.50	35.87	2.15	74.23	58.40

Input Formats

BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct:

Single image:

<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n

Multi-turn conversations:

<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n

For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows:

<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n

Loading the model locally

After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference.

from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor 
model_id = "BigDocs/BigDocs-Phi-3.5-instruct"

# Note: set _attn_implementation='eager' if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  device_map="cuda", 
  trust_remote_code=True, 
  torch_dtype="auto", 
  _attn_implementation='flash_attention_2'    
)
# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id, 
  trust_remote_code=True, 
  num_crops=4
) 

images = []
placeholder = ""

# Note: if OOM, you might consider reduce number of frames in this example.
for i in range(1,20):
    url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" 
    images.append(Image.open(requests.get(url, stream=True).raw))
    placeholder += f"<|image_{i}|>\n"
messages = [
    {"role": "user", "content": placeholder+"Summarize the deck of slides."},
]
prompt = processor.tokenizer.apply_chat_template(
  messages, 
  tokenize=False, 
  add_generation_prompt=True
)

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") 
generation_args = { 
    "max_new_tokens": 1000, 
    "temperature": 0.0, 
    "do_sample": False, 
} 
generate_ids = model.generate(**inputs, 
  eos_token_id=processor.tokenizer.eos_token_id, 
  **generation_args
)

# remove input tokens 
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, 
  skip_special_tokens=True, 
  clean_up_tokenization_spaces=False)[0] 

print(response)