File size: 8,219 Bytes
c3b99cb f63ea4d c3b99cb e409879 f63ea4d 82856fb c3b99cb f63ea4d c3b99cb f63ea4d c3b99cb 3c3ffb1 f63ea4d c3b99cb f63ea4d e409879 c3b99cb f63ea4d c3b99cb f63ea4d c3b99cb f63ea4d c3b99cb f63ea4d c3b99cb 7fc3baf c3b99cb 7fc3baf c3b99cb 6921216 683ed19 6921216 683ed19 6921216 7fc3baf 6921216 7fc3baf d502e16 7fc3baf b46803d 7fc3baf b46803d 7fc3baf b46803d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
inference: false
language:
- th
- en
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2-VL-7B-Instruct
---
# **Typhoon2-Vision**
**Typhoon2-qwen2vl-7b-vision-instruct** is a Thai 🇹🇭 vision-language model designed to support both image and video inputs. While Qwen2-VL is built to handle both image and video processing tasks, Typhoon2-VL is specifically optimized for image-based applications.
For technical-report. please see our [arxiv](https://arxiv.org/abs/2412.13702).
# **Model Description**
Here we provide **Typhoon2-qwen2vl-7b-vision-instruct** which is built upon [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
- **Model type**: A 7B instruct decoder-only model with vision encoder based on Qwen2 architecture.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **Demo:**: [https://vision.opentyphoon.ai/](https://vision.opentyphoon.ai/)
- **License**: Apache-2.0
# **Quickstart**
Here we show a code snippet to show you how to use the model with transformers.
Before running the snippet, you need to install the following dependencies:
```shell
pip install torch transformers accelerate pillow
```
## How to Get Started with the Model
Use the code below to get started with the model.
<p align="center">
<img src="https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg" width="80%"/>
<p>
**Question:** ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย
**Answer:** พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย
```python
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Image
url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย"},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
# ['พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย']
```
### Processing Multiple Images
```python
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Messages containing multiple images and a text query
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{
"type": "image",
},
{"type": "text", "text": "ระบุ 3 สิ่งที่คล้ายกันในสองภาพนี้"},
],
}
]
urls = [
"https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg",
"https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['1. ทั้งสองภาพแสดงสถาปัตยกรรมที่มีลักษณะคล้ายกัน\n2. ทั้งสองภาพมีสีสันที่สวยงาม\n3. ทั้งสองภาพมีทิวทัศน์ที่สวยงาม']
```
### Tips
To balance between performance of the model and the cost of computation, you can set minimum and maximum number of pixels by passing arguments to the processer.
```python
min_pixels = 128 * 28 * 28
max_pixels = 2560 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_name, min_pixels=min_pixels, max_pixels=max_pixels
)
```
### Evaluation (Image)
| Benchmark | **Llama-3.2-11B-Vision-Instruct** | **Qwen2-VL-7B-Instruct** | **Pathumma-llm-vision-1.0.0** | **Typhoon2-qwen2vl-7b-vision-instruct** |
|-------------------------------------------|-----------------|---------------|---------------|------------------------|
| **OCRBench** [Liu et al., 2024c](#) | **72.84** / 51.10 | 72.31 / **57.90** | 32.74 / 25.87 | 64.38 / 49.60 |
| **MMBench (Dev)** [Liu et al., 2024b](#) | 76.54 / - | **84.10** / - | 19.51 / - | 83.66 / - |
| **ChartQA** [Masry et al., 2022](#) | 13.41 / x | 47.45 / 45.00 | 64.20 / 57.83 | **75.71** / **72.56** |
| **TextVQA** [Singh et al., 2019](#) | 32.82 / x | 91.40 / 88.70 | 32.54 / 28.84 | **91.45** / **88.97** |
| **OCR (TH)** [OpenThaiGPT, 2024](#) | **64.41** / 35.58 | 56.47 / 55.34 | 6.38 / 2.88 | 64.24 / **63.11** |
| **M3Exam Images (TH)** [Zhang et al., 2023c](#) | 25.46 / - | 32.17 / - | 29.01 / - | **33.67** / - |
| **GQA (TH)** [Hudson et al., 2019](#) | 31.33 / - | 34.55 / - | 10.20 / - | **50.25** / - |
| **MTVQ (TH)** [Tang et al., 2024b](#) | 11.21 / 4.31 | 23.39 / 13.79 | 7.63 / 1.72 | **30.59** / **21.55** |
| **Average** | 37.67 / x | 54.26 / 53.85 | 25.61 / 23.67 | **62.77** / **59.02** |
Note: The first value in each cell represents **Rouge-L**.The second value (after `/`) represents **Accuracy**, normalized such that **Rouge-L = 100%**.
## **Intended Uses & Limitations**
This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.
## **Follow us**
**https://twitter.com/opentyphoon**
## **Support**
**https://discord.gg/CqyBscMFpg**
## **Citation**
- If you find Typhoon2 useful for your work, please cite it using:
```
@misc{typhoon2,
title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
year={2024},
eprint={2412.13702},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13702},
}
``` |