Qwen3-VL-2B-Instruct-GPTQ-Int4

This version of Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(168 tokens)	w4a16	CMM	Flash
AX650	384*384	1	222 ms	678 ms	7.0 tokens/sec	5.6GiB	5.6GiB

Video Process

Chips	input size	image num	image encoder	ttft(600 tokens)	w4a16	CMM	Flash
AX650	384*384	8	773 ms	1887 ms	7.1 tokens/sec	5.6GiB	5.6GiB

Image Process (Image Encoder U8+U16 Quantization)

Chips	input size	image num	image encoder	ttft(168 tokens)	w4a16	CMM	Flash
AX650	384*384	1	143 ms	678 ms	7.0 tokens/sec	5.6GiB	5.6GiB

Video Process (Image Encoder U8+U16 Quantization)

Chips	input size	image num	image encoder	ttft(600 tokens)	w4a16	CMM	Flash
AX650	384*384	8	498 ms	1887 ms	7.1 tokens/sec	5.6GiB	5.6GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

Prepare tokenizer server

Install transformer

pip install -r requirements.txt

Demo Run

Image understand demo

start tokenizer server for image understand demo

python3 tokenizer_images.py --port 8080

run image understand demo

input text

描述这张图片

input image

root@ax650 ~/Qwen3-VL-4B-Instruct-GPTQ-Int4 # bash run_image_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
  2% | █                                 |   1 /  39 [0.01s<0.58s, 66.67 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.02s<0.37s, 105.26 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [11.33s<11.05s, 3.53 count/s] init vpm axmodel ok,remain_cmm(2199 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:2199 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片
image >> images/recoAll_attractions_1.jpg
[I][                     EncodeImage][ 440]: pixel_values size 1
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 222.440994 ms, size : 1
[I][                          Encode][ 532]: input_ids size:168
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:1, 368640
[I][                          Encode][ 583]: out_embed size:430080
[I][                          Encode][ 584]: input_ids size 168
[I][                          Encode][ 586]: position_ids size:168
[I][                             Run][ 607]: input token num : 168, prefill_split_num : 2
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:40
[I][                             Run][ 865]: ttft: 676.16 ms
这张图片展示了埃及吉萨的金字塔群，背景是晴朗的蓝天，前景是广阔的沙漠。

画面中主要可见三座金字塔：
- 最大的一座是著名的**胡夫金字塔**，它位于画面中央偏左，是三座金字塔中最高、最显眼的。
- 在其右侧，是稍小一些的**卡纳克金字塔**（或称“卡纳克金字塔”）。
- 在画面最左侧，可以看到一座更小的金字塔，可能是**门卡乌金字塔**或**哈夫拉金字塔**。

这三座金字塔都是古埃及法老的陵墓，是古代世界七大奇迹中唯一现存的。它们的结构和规模令人惊叹，体现了古埃及人在建筑、数学和天文学方面的卓越成就。

整个场景在阳光下显得庄严而神秘，是埃及最具代表性的历史遗迹之一。

[N][                             Run][ 992]: hit eos,avg 7.12 token/s

Video understand demo

start tokenizer server for image understand demo

python tokenizer_video.py --port 8080

run video understand demo

input text

描述这个视频

input video

./video

root@ax650 ~/Qwen3-VL-4B-Instruct-GPTQ-Int4 # bash run_video_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
  2% | █                                 |   1 /  39 [0.02s<0.62s, 62.50 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.02s<0.39s, 100.00 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [44.70s<43.58s, 0.89 count/s] init vpm axmodel ok,remain_cmm(2199 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:2199 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                     EncodeImage][ 440]: pixel_values size 4
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 773.406006 ms, size : 4
[I][                          Encode][ 532]: input_ids size:600
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:4, 368640
[I][                          Encode][ 574]: offset:159
[I][                          Encode][ 574]: offset:303
[I][                          Encode][ 574]: offset:447
[I][                          Encode][ 583]: out_embed size:1536000
[I][                          Encode][ 584]: input_ids size 600
[I][                          Encode][ 586]: position_ids size:600
[I][                             Run][ 607]: input token num : 600, prefill_split_num : 5
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:88

[I][                             Run][ 865]: ttft: 1886.83 ms
这个视频展示了一群**土拨鼠**（或称“旱獭”）在山间草地上嬉戏打斗的场景。

**画面细节：**

- **主体动物**：画面中有多只土拨鼠，它们毛色以灰、棕、白相间，腹部和四肢颜色较浅，背部较深。它们体型圆润，耳朵短小，表情生动。
- **动作**：这些土拨鼠似乎在进行一场“打斗”或“嬉戏”。它们互相扑腾、跳跃、用前爪拍打、甚至互相“拥抱”或“推搡”。动作非常活跃，充满动感，有些画面甚至有轻微的运动模糊，增强了动态感。
- **背景**：背景是连绵起伏的山峦，山坡上覆盖着绿色植被，远处可见裸露的岩石和山体，天空湛蓝，阳光明媚，说明是白天晴朗的天气。
- **前景**：它们站在一片布满小石子和草的地面，看起来像是山间小径或开阔地。
- **构图**：画面采用近景特写，聚焦于土拨鼠的互动，背景虚化，突出了主体的动态和表情。整体构图充满活力和趣味性。

**风格与氛围：**

- ��张图片/视频具有**拟人化和趣味性**，土拨鼠的动作被夸张化，仿佛在“打斗”或“跳舞”，非常可爱。
- 画面色彩明亮，阳光充足，给人一种**自然、活泼、欢乐**的感觉。

**总结：**

这是一段充满趣味和活力的野生动物短片，展现了土拨鼠在自然环境中的社交行为，它们的“打斗”其实可能是玩耍、争夺领地或建立社交关系的自然行为。整体画面生动、可爱，极具观赏性。

---

**注意**：虽然土拨鼠（旱獭）在野外确实会互相打斗，但这种“打斗”通常是**玩耍或社交行为**，并非真正的攻击。视频中的“打斗”更像是它们的社交互动，非常可爱。

[N][                             Run][ 992]: hit eos,avg 7.10 token/s

prompt >> q

Downloads last month: 45

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(20)

this model