Instructions to use AxisQuant/Qwen3.6-27b-gptq-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AxisQuant/Qwen3.6-27b-gptq-int4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AxisQuant/Qwen3.6-27b-gptq-int4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("AxisQuant/Qwen3.6-27b-gptq-int4")
model = AutoModelForImageTextToText.from_pretrained("AxisQuant/Qwen3.6-27b-gptq-int4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AxisQuant/Qwen3.6-27b-gptq-int4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AxisQuant/Qwen3.6-27b-gptq-int4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AxisQuant/Qwen3.6-27b-gptq-int4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AxisQuant/Qwen3.6-27b-gptq-int4

SGLang

How to use AxisQuant/Qwen3.6-27b-gptq-int4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AxisQuant/Qwen3.6-27b-gptq-int4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AxisQuant/Qwen3.6-27b-gptq-int4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AxisQuant/Qwen3.6-27b-gptq-int4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AxisQuant/Qwen3.6-27b-gptq-int4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AxisQuant/Qwen3.6-27b-gptq-int4 with Docker Model Runner:
```
docker model run hf.co/AxisQuant/Qwen3.6-27b-gptq-int4
```

Author: Prashant Takale

Qwen3.6-27b-gptq-int4

GPTQ INT4 quantization of Qwen/Qwen3.6-27B. 3× smaller. ~2.4× faster.

Base model: Model Name

Method: Quantization / Fine-tuning Method
(configuration details here)

Tooling: Tool Name

License: Apache-2.0

Model Compression

	BF16 baseline	GPTQ INT4 (this model)
VRAM at load	~54 GB	~14 GB (3.9× smaller)
Bits / weight	16	4.29 (3.7× fewer)

Benchmarks

Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.

Both models evaluated under identical conditions with lm-evaluation-harness: greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.

Section	Task	Metric	N	BF16	INT4	Δ (pp)
Multiple-choice (Science)	ARC-Challenge	`acc_norm`	1172	63.91	64.08	+0.17
Math (Word problems)	GSM8K	`exact_match` (strict)	1319	96.36	96.82	+0.46
Knowledge	MMLU-Redux	`exact_match` (strict-match)	1500	89.19	88.42	−0.77
STEM Reasoning	GPQA-Diamond	`exact_match` (flexible-extract)	198	71.72	68.69	−3.03
Coding	HumanEval	`pass@1` (create_test)	164	85.98	77.44	−8.54

Inference Performance

Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):

Metric	BF16	INT4	Δ
Output token throughput (tok/s)	25.55	62.34	+143.99%
Request throughput (req/s)	0.61	1.48	+142.62%
Time to first token (ms)	79.98	77.43	−3.19%
Time per output token (ms)	38.11	14.52	−61.91%
End-to-end latency (ms)	1642.66	672.70	−59.05%

INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).

Quantization recipe

Setting	Value
Method	GPTQ
Bits	4 (weight-only)
Group size	128
`desc_act`	True (activation-order)
`damp_percent`	0.01
Symmetric	True
Calibration	C4 (`en`), 256 samples × 2048 tokens
Tool	GPTQModel v7
Effective bits / weight	4.29 BPW

The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.

Usage

With GPTQModel (recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model_id  = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model     = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out    = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id  = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", trust_remote_code=True,
)

Hardware

Weights: 18 GB on disk · ~14 GB VRAM at load
Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended

Limitations

Only the language-model weights are quantized; the vision encoder remains in BF16
Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
Thinking mode (enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency

License

Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.

Citation

Base model

@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Quantization method

@article{frantar2022gptq,
    title   = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
    author  = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
    journal = {arXiv preprint arXiv:2210.17323},
    year    = {2022}
}

Downloads last month: 12,675

Safetensors

Model size

27B params

Tensor type

BF16

I32

Model tree for AxisQuant/Qwen3.6-27b-gptq-int4

Base model

Qwen/Qwen3.6-27B

Quantized

(400)

this model

Paper for AxisQuant/Qwen3.6-27b-gptq-int4

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Paper • 2210.17323 • Published Oct 31, 2022 • 10

Evaluation results

acc_norm on ARC-Challenge
self-reported

64.080
exact_match (strict) on GSM8K
self-reported

96.820
exact_match (strict-match) on MMLU-Redux (1500-sample subset)
self-reported

88.420
exact_match (flexible-extract) on GPQA-Diamond
self-reported

68.690
pass@1 (create_test) on HumanEval
self-reported

77.440