Instructions to use AxisQuant/Qwen3.6-27b-gptq-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AxisQuant/Qwen3.6-27b-gptq-int4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AxisQuant/Qwen3.6-27b-gptq-int4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("AxisQuant/Qwen3.6-27b-gptq-int4") model = AutoModelForImageTextToText.from_pretrained("AxisQuant/Qwen3.6-27b-gptq-int4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AxisQuant/Qwen3.6-27b-gptq-int4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AxisQuant/Qwen3.6-27b-gptq-int4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxisQuant/Qwen3.6-27b-gptq-int4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AxisQuant/Qwen3.6-27b-gptq-int4
- SGLang
How to use AxisQuant/Qwen3.6-27b-gptq-int4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AxisQuant/Qwen3.6-27b-gptq-int4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxisQuant/Qwen3.6-27b-gptq-int4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AxisQuant/Qwen3.6-27b-gptq-int4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxisQuant/Qwen3.6-27b-gptq-int4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AxisQuant/Qwen3.6-27b-gptq-int4 with Docker Model Runner:
docker model run hf.co/AxisQuant/Qwen3.6-27b-gptq-int4
Author: Prashant Takale
Qwen3.6-27b-gptq-int4
GPTQ INT4 quantization of
Qwen/Qwen3.6-27B. 3× smaller. ~2.4× faster.
Base model: Model Name
Method: Quantization / Fine-tuning Method
(configuration details here)Tooling: Tool Name
License: Apache-2.0
Model Compression
| BF16 baseline | GPTQ INT4 (this model) | |
|---|---|---|
| VRAM at load | ~54 GB | ~14 GB (3.9× smaller) |
| Bits / weight | 16 | 4.29 (3.7× fewer) |
Benchmarks
Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.
Both models evaluated under identical conditions with lm-evaluation-harness:
greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.
| Section | Task | Metric | N | BF16 | INT4 | Δ (pp) |
|---|---|---|---|---|---|---|
| Multiple-choice (Science) | ARC-Challenge | acc_norm |
1172 | 63.91 | 64.08 | +0.17 |
| Math (Word problems) | GSM8K | exact_match (strict) |
1319 | 96.36 | 96.82 | +0.46 |
| Knowledge | MMLU-Redux | exact_match (strict-match) |
1500 | 89.19 | 88.42 | −0.77 |
| STEM Reasoning | GPQA-Diamond | exact_match (flexible-extract) |
198 | 71.72 | 68.69 | −3.03 |
| Coding | HumanEval | pass@1 (create_test) |
164 | 85.98 | 77.44 | −8.54 |
Inference Performance
Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):
| Metric | BF16 | INT4 | Δ |
|---|---|---|---|
| Output token throughput (tok/s) | 25.55 | 62.34 | +143.99% |
| Request throughput (req/s) | 0.61 | 1.48 | +142.62% |
| Time to first token (ms) | 79.98 | 77.43 | −3.19% |
| Time per output token (ms) | 38.11 | 14.52 | −61.91% |
| End-to-end latency (ms) | 1642.66 | 672.70 | −59.05% |
INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).
Quantization recipe
| Setting | Value |
|---|---|
| Method | GPTQ |
| Bits | 4 (weight-only) |
| Group size | 128 |
desc_act |
True (activation-order) |
damp_percent |
0.01 |
| Symmetric | True |
| Calibration | C4 (en), 256 samples × 2048 tokens |
| Tool | GPTQModel v7 |
| Effective bits / weight | 4.29 BPW |
The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.
Usage
With GPTQModel (recommended)
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model_id = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)
messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, device_map="auto", trust_remote_code=True,
)
Hardware
- Weights: 18 GB on disk · ~14 GB VRAM at load
- Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
- Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended
Limitations
- Only the language-model weights are quantized; the vision encoder remains in BF16
- Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
- Thinking mode (
enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency
License
Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.
Citation
Base model
@misc{qwen3.6-27b,
title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
author = {{Qwen Team}},
month = {April},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.6-27b}
}
Quantization method
@article{frantar2022gptq,
title = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
journal = {arXiv preprint arXiv:2210.17323},
year = {2022}
}
- Downloads last month
- 12,675
Model tree for AxisQuant/Qwen3.6-27b-gptq-int4
Base model
Qwen/Qwen3.6-27BPaper for AxisQuant/Qwen3.6-27b-gptq-int4
Evaluation results
- acc_norm on ARC-Challengeself-reported64.080
- exact_match (strict) on GSM8Kself-reported96.820
- exact_match (strict-match) on MMLU-Redux (1500-sample subset)self-reported88.420
- exact_match (flexible-extract) on GPQA-Diamondself-reported68.690
- pass@1 (create_test) on HumanEvalself-reported77.440

