Breeze-7B-32k-Instruct-v1_0-AWQ
- Model creator: MediaTek Research
- Original model: Breeze-7B-32k-Instruct-v1_0
Description
This repo contains AWQ model files for MediaTek Research's Breeze-7B-32k-Instruct-v1_0.
About AWQ
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.
It is supported by:
- Text Generation Webui - using Loader: AutoAWQ
- vLLM - version 0.2.2 or later for support for all model types.
- Hugging Face Text Generation Inference (TGI)
- Transformers version 4.35.0 and later, from any code or client that supports Transformers
- AutoAWQ - for use from Python code
Multi-user inference server: vLLM
Documentation on installing and using vLLM can be found here.
- Please ensure you are using vLLM version 0.2 or later.
- When using vLLM as a server, pass the
--quantization awq
parameter.
For example:
python3 -m vllm.entrypoints.api_server \
--model chienweichang/Breeze-7B-32k-Instruct-v1_0-AWQ \
--quantization awq \
--max-model-len 2048 \
--dtype auto
- When using vLLM from Python code, again set
quantization=awq
.
For example:
from vllm import LLM, SamplingParams
prompts = [
"告訴我AI是什麼",
"(291 - 150) 是多少?",
"台灣最高的山是哪座?",
]
prompt_template='''[INST] {prompt} [/INST]
'''
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16384)
llm = LLM(model="chienweichang/Breeze-7B-32k-Instruct-v1_0-AWQ", quantization="awq", dtype="half", max_model_len=16384)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Inference from Python code using Transformers
Install the necessary packages
- Requires: Transformers 4.37.0 or later.
- Requires: AutoAWQ 0.1.8 or later.
pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0"
If you have problems installing AutoAWQ using the pre-built wheels, install it from source instead:
pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .
Transformers example code (requires Transformers 4.37.0 and later)
from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM
checkpoint = "chienweichang/Breeze-7B-32k-Instruct-v1_0-AWQ"
model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
checkpoint,
device_map="auto",
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
# 創建一個用於文本生成的pipeline。
text_generation_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
use_cache=True,
device_map="auto",
max_length=32768,
do_sample=True,
top_k=5,
num_return_sequences=1,
streamer=streamer,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
# Inference is also possible via Transformers' pipeline
print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?"))
- Downloads last month
- 5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.