Instructions to use typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct")
model = AutoModelForImageTextToText.from_pretrained("typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct

SGLang

How to use typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct with Docker Model Runner:
```
docker model run hf.co/typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Typhoon2-Vision

Typhoon2-qwen2vl-7b-vision-instruct is a Thai 🇹🇭 vision-language model designed to support both image and video inputs. While Qwen2-VL is built to handle both image and video processing tasks, Typhoon2-VL is specifically optimized for image-based applications.

For technical-report. please see our arxiv.

Model Description

Here we provide Typhoon2-qwen2vl-7b-vision-instruct which is built upon Qwen2-VL-7B-Instruct.

Model type: A 7B instruct decoder-only model with vision encoder based on Qwen2 architecture.
Requirement: transformers 4.38.0 or newer.
Primary Language(s): Thai 🇹🇭 and English 🇬🇧
Demo:: https://vision.opentyphoon.ai/
License: Apache-2.0

Quickstart

Here we show a code snippet to show you how to use the model with transformers.

Before running the snippet, you need to install the following dependencies:

pip install torch transformers accelerate pillow

How to Get Started with the Model

Use the code below to get started with the model.

Question: ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย
Answer: พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย

from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Image
url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย"},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
# ['พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย']

Processing Multiple Images

from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Messages containing multiple images and a text query
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "image",
            },
            {"type": "text", "text": "ระบุ 3 สิ่งที่คล้ายกันในสองภาพนี้"},
        ],
    }
]

urls = [
    "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg",
    "https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['1. ทั้งสองภาพแสดงสถาปัตยกรรมที่มีลักษณะคล้ายกัน\n2. ทั้งสองภาพมีสีสันที่สวยงาม\n3. ทั้งสองภาพมีทิวทัศน์ที่สวยงาม']

Tips

To balance between performance of the model and the cost of computation, you can set minimum and maximum number of pixels by passing arguments to the processer.

min_pixels = 128 * 28 * 28
max_pixels = 2560 * 28 * 28
processor = AutoProcessor.from_pretrained(
    model_name, min_pixels=min_pixels, max_pixels=max_pixels
)

Evaluation (Image)

Benchmark	Llama-3.2-11B-Vision-Instruct	Qwen2-VL-7B-Instruct	Pathumma-llm-vision-1.0.0	Typhoon2-qwen2vl-7b-vision-instruct
OCRBench Liu et al., 2024c	72.84 / 51.10	72.31 / 57.90	32.74 / 25.87	64.38 / 49.60
MMBench (Dev) Liu et al., 2024b	76.54 / -	84.10 / -	19.51 / -	83.66 / -
ChartQA Masry et al., 2022	13.41 / x	47.45 / 45.00	64.20 / 57.83	75.71 / 72.56
TextVQA Singh et al., 2019	32.82 / x	91.40 / 88.70	32.54 / 28.84	91.45 / 88.97
OCR (TH) OpenThaiGPT, 2024	64.41 / 35.58	56.47 / 55.34	6.38 / 2.88	64.24 / 63.11
M3Exam Images (TH) Zhang et al., 2023c	25.46 / -	32.17 / -	29.01 / -	33.67 / -
GQA (TH) Hudson et al., 2019	31.33 / -	34.55 / -	10.20 / -	50.25 / -
MTVQ (TH) Tang et al., 2024b	11.21 / 4.31	23.39 / 13.79	7.63 / 1.72	30.59 / 21.55
Average	37.67 / x	54.26 / 53.85	25.61 / 23.67	62.77 / 59.02

Note: The first value in each cell represents Rouge-L.The second value (after /) represents Accuracy, normalized such that Rouge-L = 100%.

Intended Uses & Limitations

This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.

https://twitter.com/opentyphoon

Support

https://discord.gg/us5gAYmrxw

Citation

If you find Typhoon2 useful for your work, please cite it using:

@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}

Downloads last month: 718

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Finetuned

(595)

this model

Finetunes

1 model

Quantizations

3 models

Space using typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct 1

Collection including typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct

Typhoon 2 Multimodal

Collection

Latest Official Multimodal ThaiLLM release by SCB 10X. • 3 items • Updated Jan 28 • 4

Paper for typhoon-ai/typhoon2-qwen2vl-7b-vision-instruct

Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models

Paper • 2412.13702 • Published Dec 18, 2024 • 2

typhoon-ai
/

typhoon2-qwen2vl-7b-vision-instruct