File size: 8,219 Bytes
c3b99cb
f63ea4d
 
 
 
c3b99cb
e409879
f63ea4d
82856fb
 
c3b99cb
 
f63ea4d
c3b99cb
f63ea4d
c3b99cb
3c3ffb1
 
f63ea4d
 
c3b99cb
f63ea4d
 
 
e409879
 
c3b99cb
f63ea4d
c3b99cb
f63ea4d
c3b99cb
f63ea4d
c3b99cb
f63ea4d
 
 
c3b99cb
 
 
7fc3baf
c3b99cb
7fc3baf
 
 
 
 
 
c3b99cb
6921216
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
683ed19
6921216
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
683ed19
6921216
7fc3baf
6921216
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fc3baf
 
 
d502e16
7fc3baf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b46803d
7fc3baf
b46803d
7fc3baf
b46803d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
inference: false
language:
- th
- en
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2-VL-7B-Instruct
---

# **Typhoon2-Vision**

**Typhoon2-qwen2vl-7b-vision-instruct** is a Thai 🇹🇭 vision-language model designed to support both image and video inputs. While Qwen2-VL is built to handle both image and video processing tasks, Typhoon2-VL is specifically optimized for image-based applications.

For technical-report. please see our [arxiv](https://arxiv.org/abs/2412.13702). 

# **Model Description**
Here we provide **Typhoon2-qwen2vl-7b-vision-instruct** which is built upon [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

- **Model type**: A 7B instruct decoder-only model with vision encoder based on Qwen2 architecture.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **Demo:**: [https://vision.opentyphoon.ai/](https://vision.opentyphoon.ai/)
- **License**: Apache-2.0

# **Quickstart**

Here we show a code snippet to show you how to use the model with transformers.

Before running the snippet, you need to install the following dependencies:

```shell
pip install torch transformers accelerate pillow
```

## How to Get Started with the Model


Use the code below to get started with the model.
<p align="center">
    <img src="https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg" width="80%"/>
<p>

**Question:** ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย  
**Answer:** พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย

```python
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Image
url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย"},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
# ['พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย']
```

### Processing Multiple Images
```python
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Messages containing multiple images and a text query
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "image",
            },
            {"type": "text", "text": "ระบุ 3 สิ่งที่คล้ายกันในสองภาพนี้"},
        ],
    }
]

urls = [
    "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg",
    "https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['1. ทั้งสองภาพแสดงสถาปัตยกรรมที่มีลักษณะคล้ายกัน\n2. ทั้งสองภาพมีสีสันที่สวยงาม\n3. ทั้งสองภาพมีทิวทัศน์ที่สวยงาม']
```

### Tips
To balance between performance of the model and the cost of computation, you can set minimum and maximum number of pixels by passing arguments to the processer.
```python
min_pixels = 128 * 28 * 28
max_pixels = 2560 * 28 * 28
processor = AutoProcessor.from_pretrained(
    model_name, min_pixels=min_pixels, max_pixels=max_pixels
)
```

### Evaluation (Image)
| Benchmark                                 | **Llama-3.2-11B-Vision-Instruct**   | **Qwen2-VL-7B-Instruct**   | **Pathumma-llm-vision-1.0.0**   | **Typhoon2-qwen2vl-7b-vision-instruct**   |
|-------------------------------------------|-----------------|---------------|---------------|------------------------|
| **OCRBench** [Liu et al., 2024c](#)        | **72.84** / 51.10   | 72.31 / **57.90** | 32.74 / 25.87 | 64.38 / 49.60          |
| **MMBench (Dev)** [Liu et al., 2024b](#)   | 76.54 / -       | **84.10** / - | 19.51 / -     | 83.66 / -              |
| **ChartQA** [Masry et al., 2022](#)        | 13.41 / x       | 47.45 / 45.00 | 64.20 / 57.83 | **75.71** / **72.56**  |
| **TextVQA** [Singh et al., 2019](#)        | 32.82 / x       | 91.40 / 88.70 | 32.54 / 28.84 | **91.45** / **88.97**  |
| **OCR (TH)** [OpenThaiGPT, 2024](#)        | **64.41** / 35.58   | 56.47 / 55.34 | 6.38 / 2.88   | 64.24 / **63.11**      |
| **M3Exam Images (TH)** [Zhang et al., 2023c](#) | 25.46 / -       | 32.17 / -     | 29.01 / -     | **33.67** / -          |
| **GQA (TH)** [Hudson et al., 2019](#)      | 31.33 / -       | 34.55 / -     | 10.20 / -     | **50.25** / -          |
| **MTVQ (TH)** [Tang et al., 2024b](#)      | 11.21 / 4.31    | 23.39 / 13.79 | 7.63 / 1.72   | **30.59** / **21.55**  |
| **Average**                                | 37.67 / x       | 54.26 / 53.85 | 25.61 / 23.67 | **62.77** / **59.02**      |


Note: The first value in each cell represents **Rouge-L**.The second value (after `/`) represents **Accuracy**, normalized such that **Rouge-L = 100%**.


## **Intended Uses & Limitations**

This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.

## **Follow us**

**https://twitter.com/opentyphoon**

## **Support**

**https://discord.gg/CqyBscMFpg**

## **Citation**

- If you find Typhoon2 useful for your work, please cite it using:
```
@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}
```