Phi-4-reasoning AWQ 4-bit Quantized
This is a 4-bit AWQ quantized version of microsoft/Phi-4-reasoning.
Model Description
- Base Model: Phi-4-reasoning (14B parameters)
- Quantization Method: AWQ (Activation-aware Weight Quantization)
- Quantization Precision: 4-bit
- Group Size: 128
- Original Size: ~28 GB (FP16)
- Quantized Size: ~7 GB
- Memory Reduction: ~75%
About Phi-4-reasoning
Phi-4-reasoning is Microsoft's specialized reasoning model that excels at:
- β Step-by-step mathematical reasoning
- β Logical deduction and inference
- β Code understanding and debugging
- β Complex problem solving
- β Chain-of-thought reasoning
Released in January 2025, this model builds on the Phi-4 architecture with enhanced reasoning capabilities.
Key Findings:
- β‘ 6.9x faster inference with AWQ quantization
- β Maintains quality - Maintains minimal perplexity
- π― Best performance on code reasoning (56.7% accuracy)
- πΎ ~75% memory reduction (28GB β 7GB)
Usage
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch
model_id = "ronantakizawa/phi-4-reasoning-awq"
quantization_config = AwqConfig(
bits=4,
fuse_max_seq_len=2048,
do_fuse=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
quantization_config=quantization_config
)
# Reasoning task
prompt = "Solve step-by-step: If a train travels 120 miles in 2 hours, what is its average speed?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = "ronantakizawa/phi-4-reasoning-awq"
model = AutoAWQForCausalLM.from_quantized(
model_id,
fuse_layers=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Generate
prompt = "Explain the logic: All dogs are mammals. All mammals are animals. Therefore..."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Installation
pip install autoawq transformers accelerate
Requirements
- GPU Memory: ~8-10 GB VRAM (runs on RTX 3090, RTX 4090, A100, etc.)
- CUDA: Required for AWQ
- Python: 3.8+
Performance
- Memory Usage: ~75% reduction vs FP16
- Inference Speed: 6.9x faster than FP16 baseline
- Quality: 111.7% score retention - maintains or exceeds baseline quality
- Use Cases: Perfect for reasoning tasks on consumer GPUs
Evaluation Methodology
Tested on 11 reasoning tasks across 4 categories:
- Mathematical Reasoning (3 tests): Area/perimeter, percentages, word problems
- Logical Reasoning (3 tests): Syllogisms, logical fallacies, deductive reasoning
- Code Reasoning (3 tests): Bug detection, code comprehension, efficiency analysis
- Chain of Thought (2 tests): Multi-step problem solving, angle calculations
Evaluation metrics:
- Accuracy: Keyword-based scoring against expected outputs
- Latency: Time per inference (deterministic generation)
- Score Retention: (Quantized Score / Baseline Score) Γ 100%
Limitations
- Requires CUDA GPU (no CPU support for AWQ)
- Some complex chain-of-thought prompts may need optimization
- Calibration-dependent (quality depends on calibration data)
- Performance on specific reasoning tasks varies (see benchmarks)
License
MIT (inherited from base model)
Citation
@misc{phi-4-reasoning-awq,
author = {Ronan Takizawa},
title = {Phi-4-reasoning AWQ 4-bit Quantized},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ronantakizawa/phi-4-reasoning-awq}}
}
Base Model Citation
Please refer to the original model card for the base model citation.
Acknowledgments
- Microsoft for the Phi-4-reasoning model
- MIT HAN Lab for the AWQ quantization method
- Casper Hansen and the AutoAWQ team
Repository: github.com/ronantakizawa/phi4-reasoning-awq
- Downloads last month
- 41
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support