--- inference: false language: - th - en library_name: transformers license: apache-2.0 pipeline_tag: text-generation base_model: - Qwen/Qwen2-VL-7B-Instruct --- # **Typhoon2-Vision** **Typhoon2-qwen2vl-7b-vision-instruct** is a Thai ðŸ‡đ🇭 vision-language model designed to support both image and video inputs. While Qwen2-VL is built to handle both image and video processing tasks, Typhoon2-VL is specifically optimized for image-based applications. For technical-report. please see our [arxiv](https://arxiv.org/abs/2412.13702). # **Model Description** Here we provide **Typhoon2-qwen2vl-7b-vision-instruct** which is built upon [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct). - **Model type**: A 7B instruct decoder-only model with vision encoder based on Qwen2 architecture. - **Requirement**: transformers 4.38.0 or newer. - **Primary Language(s)**: Thai ðŸ‡đ🇭 and English 🇎🇧 - **Demo:**: [https://vision.opentyphoon.ai/](https://vision.opentyphoon.ai/) - **License**: Apache-2.0 # **Quickstart** Here we show a code snippet to show you how to use the model with transformers. Before running the snippet, you need to install the following dependencies: ```shell pip install torch transformers accelerate pillow ``` ## How to Get Started with the Model Use the code below to get started with the model.

**Question:** āļĢāļ°āļšāļļāļŠāļ·āđˆāļ­āļŠāļ–āļēāļ™āļ—āļĩāđˆāđāļĨāļ°āļ›āļĢāļ°āđ€āļ—āļĻāļ‚āļ­āļ‡āļ āļēāļžāļ™āļĩāđ‰āđ€āļ›āđ‡āļ™āļ āļēāļĐāļēāđ„āļ—āļĒ **Answer:** āļžāļĢāļ°āļšāļĢāļĄāļĄāļŦāļēāļĢāļēāļŠāļ§āļąāļ‡, āļāļĢāļļāļ‡āđ€āļ—āļžāļŊ, āļ›āļĢāļ°āđ€āļ—āļĻāđ„āļ—āļĒ ```python from PIL import Image import requests from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct" model = Qwen2VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_name) # Image url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg" image = Image.open(requests.get(url, stream=True).raw) conversation = [ { "role": "user", "content": [ { "type": "image", }, {"type": "text", "text": "āļĢāļ°āļšāļļāļŠāļ·āđˆāļ­āļŠāļ–āļēāļ™āļ—āļĩāđˆāđāļĨāļ°āļ›āļĢāļ°āđ€āļ—āļĻāļ‚āļ­āļ‡āļ āļēāļžāļ™āļĩāđ‰āđ€āļ›āđ‡āļ™āļ āļēāļĐāļēāđ„āļ—āļĒ"}, ], } ] # Preprocess the inputs text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( text=[text_prompt], images=[image], padding=True, return_tensors="pt" ) inputs = inputs.to("cuda") output_ids = model.generate(**inputs, max_new_tokens=128) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids) ] output_text = processor.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True ) print(output_text) # ['āļžāļĢāļ°āļšāļĢāļĄāļĄāļŦāļēāļĢāļēāļŠāļ§āļąāļ‡, āļāļĢāļļāļ‡āđ€āļ—āļžāļŊ, āļ›āļĢāļ°āđ€āļ—āļĻāđ„āļ—āļĒ'] ``` ### Processing Multiple Images ```python from PIL import Image import requests from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct" model = Qwen2VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_name) # Messages containing multiple images and a text query conversation = [ { "role": "user", "content": [ { "type": "image", }, { "type": "image", }, {"type": "text", "text": "āļĢāļ°āļšāļļ 3 āļŠāļīāđˆāļ‡āļ—āļĩāđˆāļ„āļĨāđ‰āļēāļĒāļāļąāļ™āđƒāļ™āļŠāļ­āļ‡āļ āļēāļžāļ™āļĩāđ‰"}, ], } ] urls = [ "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg", "https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg", ] images = [Image.open(requests.get(url, stream=True).raw) for url in urls] # Preprocess the inputs text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt") inputs = inputs.to("cuda") # Inference generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) # ['1. āļ—āļąāđ‰āļ‡āļŠāļ­āļ‡āļ āļēāļžāđāļŠāļ”āļ‡āļŠāļ–āļēāļ›āļąāļ•āļĒāļāļĢāļĢāļĄāļ—āļĩāđˆāļĄāļĩāļĨāļąāļāļĐāļ“āļ°āļ„āļĨāđ‰āļēāļĒāļāļąāļ™\n2. āļ—āļąāđ‰āļ‡āļŠāļ­āļ‡āļ āļēāļžāļĄāļĩāļŠāļĩāļŠāļąāļ™āļ—āļĩāđˆāļŠāļ§āļĒāļ‡āļēāļĄ\n3. āļ—āļąāđ‰āļ‡āļŠāļ­āļ‡āļ āļēāļžāļĄāļĩāļ—āļīāļ§āļ—āļąāļĻāļ™āđŒāļ—āļĩāđˆāļŠāļ§āļĒāļ‡āļēāļĄ'] ``` ### Tips To balance between performance of the model and the cost of computation, you can set minimum and maximum number of pixels by passing arguments to the processer. ```python min_pixels = 128 * 28 * 28 max_pixels = 2560 * 28 * 28 processor = AutoProcessor.from_pretrained( model_name, min_pixels=min_pixels, max_pixels=max_pixels ) ``` ### Evaluation (Image) | Benchmark | **Llama-3.2-11B-Vision-Instruct** | **Qwen2-VL-7B-Instruct** | **Pathumma-llm-vision-1.0.0** | **Typhoon2-qwen2vl-7b-vision-instruct** | |-------------------------------------------|-----------------|---------------|---------------|------------------------| | **OCRBench** [Liu et al., 2024c](#) | **72.84** / 51.10 | 72.31 / **57.90** | 32.74 / 25.87 | 64.38 / 49.60 | | **MMBench (Dev)** [Liu et al., 2024b](#) | 76.54 / - | **84.10** / - | 19.51 / - | 83.66 / - | | **ChartQA** [Masry et al., 2022](#) | 13.41 / x | 47.45 / 45.00 | 64.20 / 57.83 | **75.71** / **72.56** | | **TextVQA** [Singh et al., 2019](#) | 32.82 / x | 91.40 / 88.70 | 32.54 / 28.84 | **91.45** / **88.97** | | **OCR (TH)** [OpenThaiGPT, 2024](#) | **64.41** / 35.58 | 56.47 / 55.34 | 6.38 / 2.88 | 64.24 / **63.11** | | **M3Exam Images (TH)** [Zhang et al., 2023c](#) | 25.46 / - | 32.17 / - | 29.01 / - | **33.67** / - | | **GQA (TH)** [Hudson et al., 2019](#) | 31.33 / - | 34.55 / - | 10.20 / - | **50.25** / - | | **MTVQ (TH)** [Tang et al., 2024b](#) | 11.21 / 4.31 | 23.39 / 13.79 | 7.63 / 1.72 | **30.59** / **21.55** | | **Average** | 37.67 / x | 54.26 / 53.85 | 25.61 / 23.67 | **62.77** / **59.02** | Note: The first value in each cell represents **Rouge-L**.The second value (after `/`) represents **Accuracy**, normalized such that **Rouge-L = 100%**. ## **Intended Uses & Limitations** This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case. ## **Follow us** **https://twitter.com/opentyphoon** ## **Support** **https://discord.gg/CqyBscMFpg** ## **Citation** - If you find Typhoon2 useful for your work, please cite it using: ``` @misc{typhoon2, title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai}, year={2024}, eprint={2412.13702}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.13702}, } ```