Instructions to use olka-fi/Qwen3.5-9B-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use olka-fi/Qwen3.5-9B-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="olka-fi/Qwen3.5-9B-MXFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("olka-fi/Qwen3.5-9B-MXFP4")
model = AutoModelForImageTextToText.from_pretrained("olka-fi/Qwen3.5-9B-MXFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use olka-fi/Qwen3.5-9B-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "olka-fi/Qwen3.5-9B-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/Qwen3.5-9B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/olka-fi/Qwen3.5-9B-MXFP4

SGLang

How to use olka-fi/Qwen3.5-9B-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "olka-fi/Qwen3.5-9B-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/Qwen3.5-9B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "olka-fi/Qwen3.5-9B-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/Qwen3.5-9B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use olka-fi/Qwen3.5-9B-MXFP4 with Docker Model Runner:
```
docker model run hf.co/olka-fi/Qwen3.5-9B-MXFP4
```

Qwen3.5-9B MXFP4

MXFP4 quantized version of Qwen3.5-9B (9B parameters, dense, hybrid Gated DeltaNet + Gated Attention).

MLP weights only are quantized to MXFP4 (4-bit microscaling with e8m0 shared exponents, block size 32). All attention, linear attention (Gated DeltaNet), visual encoder, MTP layers, embeddings, and normalization layers remain in BF16.

	Original (BF16)	This model (MXFP4)
Size on disk	19 GB	12 GB
Perplexity (wikitext, 2048 ctx)	8.55	8.30

Model Details

Architecture: Qwen3.5 dense — hybrid Gated DeltaNet + Gated Attention with 32 layers
Parameters: 9B
Context length: 262,144 tokens
Vocabulary: 248,320 tokens

What's quantized

Component	Precision	Notes
MLP gate_proj, up_proj, down_proj	MXFP4 (uint8 packed + e8m0 scales)	2D standard linear weights
Self-attention (Q/K/V/O projections)	BF16	Excluded — preserves attention quality
Linear attention (Gated DeltaNet layers)	BF16	Excluded
Visual encoder	BF16	Excluded
MTP layers	BF16	Excluded
Embeddings, LM head	BF16	Excluded
LayerNorm weights	BF16	1D, not quantizable

Quantization method

Format: MXFP4 — 4-bit float (E2M1) with shared e8m0 block exponent per 32 elements
Scale selection: MSE-optimal over 3 candidate exponents per block (not simple rounding)
Output format: compressed-tensors with mxfp4-pack-quantized — compatible with stock vLLM

Usage

vLLM

pip install vllm

vllm serve olka-fi/Qwen3.5-9B-MXFP4 \
    --quantization compressed-tensors \
    --gpu-memory-utilization 0.95 \
    --max-model-len 4096

Note: Requires vLLM with Qwen3.5 architecture support (not yet in stock vLLM 0.16.0).

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="Qwen3.5-9B-MXFP4",
    messages=[{"role": "user", "content": "Who are you?"}],
)
print(response.choices[0].message.content)

Quantization Details

Quantized with qstream — custom MXFP4 quantization tool
MSE-optimal 3-candidate scale selection per block (32 elements)
Per-block shared exponent in e8m0 format
Exclude patterns: *self_attn*, *linear_attn*, *lm_head*, *embed_tokens*, *visual*, *mtp*

Acknowledgments

Based on Qwen3.5-9B by Tongyi Lab (Alibaba).

Downloads last month: 1,704

Safetensors

Model size

7B params

Tensor type

F32

BF16

Model tree for olka-fi/Qwen3.5-9B-MXFP4

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(216)

this model