Phi-4-reasoning AWQ 4-bit Quantized

This is a 4-bit AWQ quantized version of microsoft/Phi-4-reasoning.

Model Description

Base Model: Phi-4-reasoning (14B parameters)
Quantization Method: AWQ (Activation-aware Weight Quantization)
Quantization Precision: 4-bit
Group Size: 128
Original Size: ~28 GB (FP16)
Quantized Size: ~7 GB
Memory Reduction: ~75%

About Phi-4-reasoning

Phi-4-reasoning is Microsoft's specialized reasoning model that excels at:

✅ Step-by-step mathematical reasoning
✅ Logical deduction and inference
✅ Code understanding and debugging
✅ Complex problem solving
✅ Chain-of-thought reasoning

Released in January 2025, this model builds on the Phi-4 architecture with enhanced reasoning capabilities.

Key Findings:

⚡ 6.9x faster inference with AWQ quantization
✅ Maintains quality - Maintains minimal perplexity
🎯 Best performance on code reasoning (56.7% accuracy)
💾 ~75% memory reduction (28GB → 7GB)

Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch

model_id = "ronantakizawa/phi-4-reasoning-awq"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=2048,
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    quantization_config=quantization_config
)

# Reasoning task
prompt = "Solve step-by-step: If a train travels 120 miles in 2 hours, what is its average speed?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "ronantakizawa/phi-4-reasoning-awq"

model = AutoAWQForCausalLM.from_quantized(
    model_id,
    fuse_layers=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate
prompt = "Explain the logic: All dogs are mammals. All mammals are animals. Therefore..."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Installation

pip install autoawq transformers accelerate

Requirements

GPU Memory: ~8-10 GB VRAM (runs on RTX 3090, RTX 4090, A100, etc.)
CUDA: Required for AWQ
Python: 3.8+

Performance

Memory Usage: ~75% reduction vs FP16
Inference Speed: 6.9x faster than FP16 baseline
Quality: 111.7% score retention - maintains or exceeds baseline quality
Use Cases: Perfect for reasoning tasks on consumer GPUs

Evaluation Methodology

Tested on 11 reasoning tasks across 4 categories:

Mathematical Reasoning (3 tests): Area/perimeter, percentages, word problems
Logical Reasoning (3 tests): Syllogisms, logical fallacies, deductive reasoning
Code Reasoning (3 tests): Bug detection, code comprehension, efficiency analysis
Chain of Thought (2 tests): Multi-step problem solving, angle calculations

Evaluation metrics:

Accuracy: Keyword-based scoring against expected outputs
Latency: Time per inference (deterministic generation)
Score Retention: (Quantized Score / Baseline Score) × 100%

Limitations

Requires CUDA GPU (no CPU support for AWQ)
Some complex chain-of-thought prompts may need optimization
Calibration-dependent (quality depends on calibration data)
Performance on specific reasoning tasks varies (see benchmarks)

License

MIT (inherited from base model)

Citation

@misc{phi-4-reasoning-awq,
  author = {Ronan Takizawa},
  title = {Phi-4-reasoning AWQ 4-bit Quantized},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ronantakizawa/phi-4-reasoning-awq}}
}

Base Model Citation

Please refer to the original model card for the base model citation.

Acknowledgments

Microsoft for the Phi-4-reasoning model
MIT HAN Lab for the AWQ quantization method
Casper Hansen and the AutoAWQ team

Repository: github.com/ronantakizawa/phi4-reasoning-awq

Downloads last month: 41

Safetensors

Model size

15B params

Tensor type

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ronantakizawa/phi-4-reasoning-awq

Base model

microsoft/phi-4

Finetuned

microsoft/Phi-4-reasoning

Quantized

(31)

this model