--- license: mit license_link: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/LICENSE language: - multilingual pipeline_tag: text-generation tags: - nlp - code - vision inference: parameters: temperature: 0.7 widget: - messages: - role: user content: <|image_1|>Can you describe what you see in the image? --- ## Model Summary ### Phi-3 Vision model without Flash-Attention: Phi-3 Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. Resources and Technical Documentation: + [Phi-3 Microsoft Blog](https://aka.ms/Phi-3Build2024) + [Phi-3 Technical Report](https://aka.ms/phi3-tech-report) + [Phi-3 on Azure AI Studio](https://aka.ms/try-phi3vision) + [Phi-3 Cookbook](https://github.com/microsoft/Phi-3CookBook) ```python from PIL import Image import requests from transformers import AutoModelForCausalLM from transformers import AutoProcessor model_id = "Saurabh54/Phi-3-vision-128k-instruct" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto") processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) messages = [ {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"}, {"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."}, {"role": "user", "content": "Provide insightful questions to spark discussion."} ] url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png" image = Image.open(requests.get(url, stream=True).raw) prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0") generation_args = { "max_new_tokens": 500, "temperature": 0.0, "do_sample": False, } generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) # remove input tokens generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(response) ```