--- license: llama3 library_name: transformers pipeline_tag: text-generation base_model: yentinglin/Llama-3-Taiwan-8B-Instruct-128k language: - zh - en tags: - zhtw --- # wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ This model is quantized using AWQ in 4 bits; the original model is [`yentinglin/Llama-3-Taiwan-8B-Instruct-128k`](https://huggingface.co/yentinglin/Llama-3-Taiwan-8B-Instruct-128k) # quantize ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k' quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ' quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []} model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model.quantize(tokenizer, quant_config=quant_config) # Save quantized model model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path) ``` # inference with vllm ```python from vllm import LLM, SamplingParams llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ', quantization="AWQ", tensor_parallel_size=2, # number of gpus gpu_memory_utilization=0.9, dtype='half' ) tokenizer = llm.get_tokenizer() conversations = tokenizer.apply_chat_template( [{'role': 'user', 'content': "how tall is taipei 101"}], tokenize=False, ) outputs = llm.generate( [conversations], SamplingParams( temperature=0.5, top_p=0.9, min_tokens=20, max_tokens=1024, ) ) for output in outputs: generated_ids = output.outputs[0].token_ids generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) print(generated_text) ```