|
--- |
|
license: llama3 |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
base_model: yentinglin/Llama-3-Taiwan-8B-Instruct-128k |
|
language: |
|
- zh |
|
- en |
|
tags: |
|
- zhtw |
|
--- |
|
|
|
# wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ |
|
This model is quantized using AWQ in 4 bits; the original model is [`yentinglin/Llama-3-Taiwan-8B-Instruct-128k`](https://huggingface.co/yentinglin/Llama-3-Taiwan-8B-Instruct-128k) |
|
|
|
# quantize |
|
```python |
|
from awq import AutoAWQForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k' |
|
quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ' |
|
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []} |
|
|
|
model = AutoAWQForCausalLM.from_pretrained(model_path) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
model.quantize(tokenizer, quant_config=quant_config) |
|
|
|
# Save quantized model |
|
model.save_quantized(quant_path) |
|
tokenizer.save_pretrained(quant_path) |
|
``` |
|
|
|
# inference with vllm |
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ', |
|
quantization="AWQ", |
|
tensor_parallel_size=2, # number of gpus |
|
gpu_memory_utilization=0.9, |
|
dtype='half' |
|
) |
|
|
|
tokenizer = llm.get_tokenizer() |
|
conversations = tokenizer.apply_chat_template( |
|
[{'role': 'user', 'content': "how tall is taipei 101"}], |
|
tokenize=False, |
|
) |
|
|
|
outputs = llm.generate( |
|
[conversations], |
|
SamplingParams( |
|
temperature=0.5, |
|
top_p=0.9, |
|
min_tokens=20, |
|
max_tokens=1024, |
|
) |
|
) |
|
|
|
for output in outputs: |
|
generated_ids = output.outputs[0].token_ids |
|
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text) |
|
``` |