Author: Prashant Takale

Qwen3.6-27b-gptq-int4

GPTQ INT4 quantization of Qwen/Qwen3.6-27B. 3× smaller. ~2.4× faster.

Base model: Model Name

Method: Quantization / Fine-tuning Method
(configuration details here)

Tooling: Tool Name

License: Apache-2.0


Model Compression

Memory & storage reduction

BF16 baseline GPTQ INT4 (this model)
VRAM at load ~54 GB ~14 GB (3.9× smaller)
Bits / weight 16 4.29 (3.7× fewer)

Benchmarks

Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.

Both models evaluated under identical conditions with lm-evaluation-harness: greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.

Section Task Metric N BF16 INT4 Δ (pp)
Multiple-choice (Science) ARC-Challenge acc_norm 1172 63.91 64.08 +0.17
Math (Word problems) GSM8K exact_match (strict) 1319 96.36 96.82 +0.46
Knowledge MMLU-Redux exact_match (strict-match) 1500 89.19 88.42 −0.77
STEM Reasoning GPQA-Diamond exact_match (flexible-extract) 198 71.72 68.69 −3.03
Coding HumanEval pass@1 (create_test) 164 85.98 77.44 −8.54

Inference Performance

Inference performance

Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):

Metric BF16 INT4 Δ
Output token throughput (tok/s) 25.55 62.34 +143.99%
Request throughput (req/s) 0.61 1.48 +142.62%
Time to first token (ms) 79.98 77.43 −3.19%
Time per output token (ms) 38.11 14.52 −61.91%
End-to-end latency (ms) 1642.66 672.70 −59.05%

INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).


Quantization recipe

Setting Value
Method GPTQ
Bits 4 (weight-only)
Group size 128
desc_act True (activation-order)
damp_percent 0.01
Symmetric True
Calibration C4 (en), 256 samples × 2048 tokens
Tool GPTQModel v7
Effective bits / weight 4.29 BPW

The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.


Usage

With GPTQModel (recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model_id  = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model     = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out    = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id  = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", trust_remote_code=True,
)

Hardware

  • Weights: 18 GB on disk · ~14 GB VRAM at load
  • Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
  • Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended

Limitations

  • Only the language-model weights are quantized; the vision encoder remains in BF16
  • Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
  • Thinking mode (enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency

License

Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.


Citation

Base model

@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Quantization method

@article{frantar2022gptq,
    title   = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
    author  = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
    journal = {arXiv preprint arXiv:2210.17323},
    year    = {2022}
}
Downloads last month
12,675
Safetensors
Model size
27B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AxisQuant/Qwen3.6-27b-gptq-int4

Base model

Qwen/Qwen3.6-27B
Quantized
(400)
this model

Paper for AxisQuant/Qwen3.6-27b-gptq-int4

Evaluation results