wxxwxxw
/

Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ / README.md

wxxwxxw's picture

Create README.md

6615e28 verified about 1 month ago

|

history blame contribute delete

1.82 kB

	---
	license: llama3
	library_name: transformers
	pipeline_tag: text-generation
	base_model: yentinglin/Llama-3-Taiwan-8B-Instruct-128k
	language:
	- zh
	- en
	tags:
	- zhtw
	---

	# wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ
	This model is quantized using AWQ in 4 bits; the original model is [`yentinglin/Llama-3-Taiwan-8B-Instruct-128k`](https://huggingface.co/yentinglin/Llama-3-Taiwan-8B-Instruct-128k)

	# quantize
	```python
	from awq import AutoAWQForCausalLM
	from transformers import AutoTokenizer

	model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k'
	quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ'
	quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []}

	model = AutoAWQForCausalLM.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

	model.quantize(tokenizer, quant_config=quant_config)

	# Save quantized model
	model.save_quantized(quant_path)
	tokenizer.save_pretrained(quant_path)
	```

	# inference with vllm
	```python
	from vllm import LLM, SamplingParams

	llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ',
	quantization="AWQ",
	tensor_parallel_size=2, # number of gpus
	gpu_memory_utilization=0.9,
	dtype='half'
	)

	tokenizer = llm.get_tokenizer()
	conversations = tokenizer.apply_chat_template(
	[{'role': 'user', 'content': "how tall is taipei 101"}],
	tokenize=False,
	)

	outputs = llm.generate(
	[conversations],
	SamplingParams(
	temperature=0.5,
	top_p=0.9,
	min_tokens=20,
	max_tokens=1024,
	)
	)

	for output in outputs:
	generated_ids = output.outputs[0].token_ids
	generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
	print(generated_text)
	```