Phi-4-reasoning AWQ 4-bit Quantized

This is a 4-bit AWQ quantized version of microsoft/Phi-4-reasoning.

Model Description

  • Base Model: Phi-4-reasoning (14B parameters)
  • Quantization Method: AWQ (Activation-aware Weight Quantization)
  • Quantization Precision: 4-bit
  • Group Size: 128
  • Original Size: ~28 GB (FP16)
  • Quantized Size: ~7 GB
  • Memory Reduction: ~75%

About Phi-4-reasoning

Phi-4-reasoning is Microsoft's specialized reasoning model that excels at:

  • βœ… Step-by-step mathematical reasoning
  • βœ… Logical deduction and inference
  • βœ… Code understanding and debugging
  • βœ… Complex problem solving
  • βœ… Chain-of-thought reasoning

Released in January 2025, this model builds on the Phi-4 architecture with enhanced reasoning capabilities.

Key Findings:

  • ⚑ 6.9x faster inference with AWQ quantization
  • βœ… Maintains quality - Maintains minimal perplexity
  • 🎯 Best performance on code reasoning (56.7% accuracy)
  • πŸ’Ύ ~75% memory reduction (28GB β†’ 7GB)

Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch

model_id = "ronantakizawa/phi-4-reasoning-awq"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=2048,
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    quantization_config=quantization_config
)

# Reasoning task
prompt = "Solve step-by-step: If a train travels 120 miles in 2 hours, what is its average speed?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "ronantakizawa/phi-4-reasoning-awq"

model = AutoAWQForCausalLM.from_quantized(
    model_id,
    fuse_layers=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate
prompt = "Explain the logic: All dogs are mammals. All mammals are animals. Therefore..."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Installation

pip install autoawq transformers accelerate

Requirements

  • GPU Memory: ~8-10 GB VRAM (runs on RTX 3090, RTX 4090, A100, etc.)
  • CUDA: Required for AWQ
  • Python: 3.8+

Performance

  • Memory Usage: ~75% reduction vs FP16
  • Inference Speed: 6.9x faster than FP16 baseline
  • Quality: 111.7% score retention - maintains or exceeds baseline quality
  • Use Cases: Perfect for reasoning tasks on consumer GPUs

Evaluation Methodology

Tested on 11 reasoning tasks across 4 categories:

  • Mathematical Reasoning (3 tests): Area/perimeter, percentages, word problems
  • Logical Reasoning (3 tests): Syllogisms, logical fallacies, deductive reasoning
  • Code Reasoning (3 tests): Bug detection, code comprehension, efficiency analysis
  • Chain of Thought (2 tests): Multi-step problem solving, angle calculations

Evaluation metrics:

  • Accuracy: Keyword-based scoring against expected outputs
  • Latency: Time per inference (deterministic generation)
  • Score Retention: (Quantized Score / Baseline Score) Γ— 100%

Limitations

  • Requires CUDA GPU (no CPU support for AWQ)
  • Some complex chain-of-thought prompts may need optimization
  • Calibration-dependent (quality depends on calibration data)
  • Performance on specific reasoning tasks varies (see benchmarks)

License

MIT (inherited from base model)

Citation

@misc{phi-4-reasoning-awq,
  author = {Ronan Takizawa},
  title = {Phi-4-reasoning AWQ 4-bit Quantized},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ronantakizawa/phi-4-reasoning-awq}}
}

Base Model Citation

Please refer to the original model card for the base model citation.

Acknowledgments

  • Microsoft for the Phi-4-reasoning model
  • MIT HAN Lab for the AWQ quantization method
  • Casper Hansen and the AutoAWQ team

Repository: github.com/ronantakizawa/phi4-reasoning-awq

Downloads last month
41
Safetensors
Model size
15B params
Tensor type
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ronantakizawa/phi-4-reasoning-awq

Base model

microsoft/phi-4
Quantized
(31)
this model