Sreenington's picture
Create README.md
7479fcd verified
metadata
license: mit
language:
  - en
tags:
  - AWQ
  - phi3

Phi 3 mini 4k instruct - AWQ

Description

This repo contains AWQ model files for Microsoft's Phi 3 mini 4k Instruct.

About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is also now supported by continuous batching server vLLM, allowing the use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.

Model Details

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.

Resources and Technical Documentation:

Prompt Format

<|user|>
How to explain the Internet for a medieval knight?<|end|>
<|assistant|>

How to use

using vLLM

from vllm import LLM, SamplingParams

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=128)

# Create an LLM.
llm = LLM(model="Sreenington/Phi-3-mini-4k-instruct-AWQ", quantization="AWQ")

# Prompt template 
prompt = """
<|user|>
How to explain the Internet for a medieval knight?<|end|>
<|assistant|>
"""

outputs = llm.generate(prompt, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n Generated text:\n {generated_text!r}")