license: mit
language:
- en
tags:
- AWQ
- phi3
Phi 3 mini 4k instruct - AWQ
- Model creator: Microsoft
- Original model: Phi 3 mini 4k Instruct
Description
This repo contains AWQ model files for Microsoft's Phi 3 mini 4k Instruct.
About AWQ
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
It is also now supported by continuous batching server vLLM, allowing the use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.
Model Details
The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.
The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.
Resources and Technical Documentation:
- Phi-3 Microsoft Blog
- Phi-3 Technical Report
- Phi-3 on Azure AI Studio
- Phi-3 GGUF: 4K
- Phi-3 ONNX: 4K
Prompt Format
<|user|> How to explain the Internet for a medieval knight?<|end|> <|assistant|>
How to use
using vLLM
from vllm import LLM, SamplingParams
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=128)
# Create an LLM.
llm = LLM(model="Sreenington/Phi-3-mini-4k-instruct-AWQ", quantization="AWQ")
# Prompt template
prompt = """
<|user|>
How to explain the Internet for a medieval knight?<|end|>
<|assistant|>
"""
outputs = llm.generate(prompt, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}\n Generated text:\n {generated_text!r}")