Instructions to use Qwen/Qwen3.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3.5-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B") model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Qwen/Qwen3.5-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3.5-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3.5-9B
- SGLang
How to use Qwen/Qwen3.5-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen3.5-9B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3.5-9B
DGX SPARK VLLM RESULTS
#4
by RGMC98 - opened
Llama Benchy results with MTP 2
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen3.5-9B | pp2048 (c1) | 2620.55 ± 108.88 | 2620.55 ± 108.88 | 784.91 ± 33.23 | 783.02 ± 33.23 | 785.00 ± 33.22 | ||
| Qwen/Qwen3.5-9B | tg128 (c1) | 9.68 ± 0.24 | 9.68 ± 0.24 | 10.67 ± 0.47 | 10.67 ± 0.47 | |||
| Qwen/Qwen3.5-9B | pp2048 (c2) | 3149.17 ± 65.54 | 1577.58 ± 32.55 | 1301.05 ± 26.31 | 1299.16 ± 26.31 | 1301.10 ± 26.31 | ||
| Qwen/Qwen3.5-9B | tg128 (c2) | 18.51 ± 0.40 | 9.79 ± 0.17 | 22.00 ± 0.00 | 11.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d4096 (c1) | 3172.55 ± 235.93 | 3172.55 ± 235.93 | 1300.38 ± 97.06 | 1298.49 ± 97.06 | 1300.49 ± 97.07 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d4096 (c1) | 9.92 ± 0.22 | 9.92 ± 0.22 | 11.00 ± 0.00 | 11.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d4096 (c1) | 1243.76 ± 6.36 | 1243.76 ± 6.36 | 1648.56 ± 8.45 | 1646.67 ± 8.45 | 1648.66 ± 8.45 | ||
| Qwen/Qwen3.5-9B | tg128 @ d4096 (c1) | 10.09 ± 0.02 | 10.09 ± 0.02 | 11.00 ± 0.00 | 11.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d4096 (c2) | 3883.72 ± 15.81 | 1943.85 ± 7.99 | 2109.68 ± 8.86 | 2107.79 ± 8.86 | 2109.73 ± 8.85 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d4096 (c2) | 19.40 ± 0.59 | 10.05 ± 0.03 | 22.00 ± 0.00 | 11.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d4096 (c2) | 1304.04 ± 37.92 | 652.49 ± 18.98 | 3143.33 ± 93.21 | 3141.44 ± 93.21 | 3143.39 ± 93.21 | ||
| Qwen/Qwen3.5-9B | tg128 @ d4096 (c2) | 18.82 ± 0.91 | 9.91 ± 0.16 | 22.00 ± 0.00 | 11.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d8192 (c1) | 3820.75 ± 15.57 | 3820.75 ± 15.57 | 2146.18 ± 8.64 | 2144.29 ± 8.64 | 2146.25 ± 8.62 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d8192 (c1) | 9.97 ± 0.00 | 9.97 ± 0.00 | 11.00 ± 0.00 | 11.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d8192 (c1) | 775.86 ± 3.93 | 775.86 ± 3.93 | 2641.61 ± 13.35 | 2639.72 ± 13.35 | 2641.66 ± 13.34 | ||
| Qwen/Qwen3.5-9B | tg128 @ d8192 (c1) | 9.89 ± 0.01 | 9.89 ± 0.01 | 10.33 ± 0.47 | 10.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d8192 (c2) | 4077.40 ± 3.08 | 2039.80 ± 1.56 | 4018.39 ± 2.99 | 4016.50 ± 2.99 | 4018.44 ± 2.99 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d8192 (c2) | 18.92 ± 0.11 | 9.78 ± 0.11 | 21.33 ± 0.94 | 10.67 ± 0.47 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d8192 (c2) | 816.36 ± 1.05 | 408.37 ± 0.53 | 5016.98 ± 6.46 | 5015.09 ± 6.46 | 5017.04 ± 6.46 | ||
| Qwen/Qwen3.5-9B | tg128 @ d8192 (c2) | 18.48 ± 0.65 | 9.72 ± 0.02 | 20.00 ± 0.00 | 10.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d16384 (c1) | 3927.25 ± 5.82 | 3927.25 ± 5.82 | 4173.86 ± 6.41 | 4171.97 ± 6.41 | 4173.96 ± 6.40 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d16384 (c1) | 9.57 ± 0.02 | 9.57 ± 0.02 | 10.00 ± 0.00 | 10.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d16384 (c1) | 436.06 ± 0.63 | 436.06 ± 0.63 | 4698.53 ± 6.79 | 4696.64 ± 6.79 | 4698.64 ± 6.77 | ||
| Qwen/Qwen3.5-9B | tg128 @ d16384 (c1) | 9.51 ± 0.02 | 9.51 ± 0.02 | 10.00 ± 0.00 | 10.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d16384 (c2) | 4050.36 ± 66.34 | 2036.83 ± 52.49 | 8051.44 ± 202.01 | 8049.55 ± 202.01 | 8051.50 ± 202.02 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d16384 (c2) | 18.58 ± 0.09 | 9.45 ± 0.07 | 20.33 ± 0.47 | 10.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d16384 (c2) | 447.29 ± 5.77 | 224.80 ± 4.71 | 9116.12 ± 187.59 | 9114.23 ± 187.59 | 9116.16 ± 187.59 | ||
| Qwen/Qwen3.5-9B | tg128 @ d16384 (c2) | 18.20 ± 0.03 | 9.35 ± 0.13 | 20.33 ± 0.47 | 10.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d32768 (c1) | 3906.92 ± 2.15 | 3906.92 ± 2.15 | 8389.14 ± 4.70 | 8387.25 ± 4.70 | 8389.22 ± 4.70 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d32768 (c1) | 8.65 ± 0.02 | 8.65 ± 0.02 | 9.00 ± 0.00 | 9.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d32768 (c1) | 224.00 ± 1.87 | 224.00 ± 1.87 | 9145.26 ± 75.94 | 9143.37 ± 75.94 | 9145.36 ± 75.96 | ||
| Qwen/Qwen3.5-9B | tg128 @ d32768 (c1) | 8.61 ± 0.02 | 8.61 ± 0.02 | 9.00 ± 0.00 | 9.00 ± 0.00 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d32768 (c2) | 3712.80 ± 281.87 | 1874.29 ± 154.97 | 17612.52 ± 1538.50 | 17610.63 ± 1538.50 | 17612.58 ± 1538.51 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d32768 (c2) | 16.27 ± 0.66 | 8.64 ± 0.20 | 18.00 ± 0.00 | 9.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d32768 (c2) | 216.48 ± 15.33 | 109.29 ± 8.49 | 18861.48 ± 1543.33 | 18859.59 ± 1543.33 | 18861.53 ± 1543.32 | ||
| Qwen/Qwen3.5-9B | tg128 @ d32768 (c2) | 16.14 ± 0.76 | 8.55 ± 0.25 | 18.00 ± 0.00 | 9.17 ± 0.37 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d65535 (c1) | 3157.82 ± 119.30 | 3157.82 ± 119.30 | 20785.76 ± 803.08 | 20783.87 ± 803.08 | 20785.85 ± 803.07 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d65535 (c1) | 7.72 ± 0.09 | 7.72 ± 0.09 | 8.33 ± 0.47 | 8.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d65535 (c1) | 91.90 ± 0.46 | 91.90 ± 0.46 | 22286.91 ± 111.65 | 22285.02 ± 111.65 | 22286.99 ± 111.63 | ||
| Qwen/Qwen3.5-9B | tg128 @ d65535 (c1) | 7.64 ± 0.09 | 7.64 ± 0.09 | 8.33 ± 0.47 | 8.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d65535 (c2) | 2286.44 ± 491.33 | 1155.57 ± 254.76 | 60172.33 ± 15685.93 | 60170.44 ± 15685.93 | 60175.67 ± 15685.28 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d65535 (c2) | 7.94 ± 1.11 | 4.42 ± 0.36 | 14.67 ± 0.94 | 7.33 ± 0.47 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d65535 (c2) | 80.29 ± 2.96 | 40.45 ± 1.46 | 50694.61 ± 1874.98 | 50692.72 ± 1874.98 | 50698.34 ± 1874.29 | ||
| Qwen/Qwen3.5-9B | tg128 @ d65535 (c2) | 10.07 ± 1.40 | 5.35 ± 0.89 | 15.33 ± 0.94 | 7.67 ± 0.47 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d100000 (c1) | 2407.21 ± 33.79 | 2407.21 ± 33.79 | 41552.12 ± 578.07 | 41550.23 ± 578.07 | 41582.82 ± 599.50 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d100000 (c1) | 4.31 ± 0.81 | 4.31 ± 0.81 | 7.67 ± 1.25 | 7.67 ± 1.25 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d100000 (c1) | 44.33 ± 4.38 | 44.33 ± 4.38 | 46680.51 ± 4856.03 | 46678.62 ± 4856.03 | 46702.83 ± 4849.40 | ||
| Qwen/Qwen3.5-9B | tg128 @ d100000 (c1) | 5.64 ± 0.49 | 5.64 ± 0.49 | 7.67 ± 0.94 | 7.67 ± 0.94 | |||
| Qwen/Qwen3.5-9B | ctx_pp @ d100000 (c2) | 2328.80 ± 182.23 | 1200.93 ± 76.42 | 83646.13 ± 5896.07 | 83644.24 ± 5896.07 | 83653.27 ± 5892.10 | ||
| Qwen/Qwen3.5-9B | ctx_tg @ d100000 (c2) | 5.98 ± 4.08 | 4.48 ± 2.12 | 11.67 ± 4.78 | 7.06 ± 2.53 | |||
| Qwen/Qwen3.5-9B | pp2048 @ d100000 (c2) | 48.26 ± 0.91 | 24.32 ± 0.50 | 84264.99 ± 1754.30 | 84263.10 ± 1754.30 | 84269.39 ± 1752.62 | ||
| Qwen/Qwen3.5-9B | tg128 @ d100000 (c2) | 8.28 ± 1.01 | 4.56 ± 0.57 | 14.67 ± 0.94 | 7.33 ± 0.47 |
The generation speed appears to be quite low, at only 9.68 to 18.51 t/s. It's not reasonable for a 9B model that supports linear attention. Does this indicate that there is still room for optimization?
On a system with large amount of low bandwidth memory, going with MoE option is the obvious better choice