wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ
This model is quantized using AWQ in 4 bits; the original model is yentinglin/Llama-3-Taiwan-8B-Instruct-128k
quantize
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k'
quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []}
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
inference with vllm
from vllm import LLM, SamplingParams
llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ',
quantization="AWQ",
tensor_parallel_size=2, # number of gpus
gpu_memory_utilization=0.9,
dtype='half'
)
tokenizer = llm.get_tokenizer()
conversations = tokenizer.apply_chat_template(
[{'role': 'user', 'content': "how tall is taipei 101"}],
tokenize=False,
)
outputs = llm.generate(
[conversations],
SamplingParams(
temperature=0.5,
top_p=0.9,
min_tokens=20,
max_tokens=1024,
)
)
for output in outputs:
generated_ids = output.outputs[0].token_ids
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(generated_text)
- Downloads last month
- 76
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ
Base model
meta-llama/Meta-Llama-3-70B