Update README.md

fb23d20 verified 8 months ago

3.74 kB

	---
	license: llama2
	---


	# Sample repository

	Development Status :: 2 - Pre-Alpha <br>
	Developed by MinWoo Park, 2023, Seoul, South Korea. [Contact: parkminwoo1991@gmail.com](mailto:parkminwoo1991@gmail.com).
	[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fhuggingface.co%2Fdanielpark%2Fko-llama-2-jindo-7b-instruct-4bit-128g-gptq&count_bg=%23000000&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=views&edge_flat=false)](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq)

	# danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq model card
	- 4-bit quantization and 128 group size weight of [danielpark/ko-llama-2-jindo-7b-instruct](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct)
	- GPTQ is the state-of-the-art one-shot weight quantization method. This code is built upon [GPTQ](https://github.com/IST-DASLab/gptq), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [GPTQ-triton](https://github.com/fpgaminer/GPTQ-triton), [Auto-GPTQ](https://github.com/PanQiWei/AutoGPTQ).



	## Prompt Template

	```
	### System:
	{System}

	### User:
	{User}

	### Assistant:
	{Assistant}
	```

	# Inference
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lNDLSGR4_prc1QWYrbbhsgpYwYNkklzg?usp=sharing)
	Install [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) for generating.
	```
	$ pip install auto-gptq
	```

	```python
	from transformers import AutoTokenizer, pipeline, logging
	from auto_gptq import AutoGPTQForCausalLM

	# Set config
	MODEL_NAME_OR_PATH = "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq"
	MODEL_BASENAME = "gptq_model-4bit-128g"
	USE_TRITON = False
	MODEL, TOKENIZER = AutoGPTQForCausalLM.from_quantized(
	MODEL_NAME_OR_PATH,
	model_basename=MODEL_BASENAME,
	use_safetensors=True,
	trust_remote_code=True,
	device="cuda:0",
	use_triton=USE_TRITON,
	quantize_config=None
	), AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, use_fast=True)


	def generate_text_with_model(prompt):
	prompt_template = f"{prompt}\n"
	input_ids = TOKENIZER(prompt_template, return_tensors='pt').input_ids.cuda()
	output = MODEL.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
	generated_text = TOKENIZER.decode(output[0])
	return generated_text

	def generate_text_with_pipeline(prompt):
	logging.set_verbosity(logging.CRITICAL)
	pipe = pipeline(
	"text-generation",
	model=MODEL,
	tokenizer=TOKENIZER,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.95,
	repetition_penalty=1.15
	)
	prompt_template = f"{prompt}\n"
	generated_text = pipe(prompt_template)[0]['generated_text']
	return generated_text

	# Example
	prompt_text = "What is GPTQ?"
	generated_text_model = generate_text_with_model(prompt_text)
	print(generated_text_model)

	generated_text_pipeline = generate_text_with_pipeline(prompt_text)
	print(generated_text_pipeline)

	```


	## Web Demo
	I implement the web demo using several popular tools that allow us to rapidly create web UIs.
	\| model \| web ui \| quantinized \|
	\| --- \| --- \| --- \|
	\| danielpark/ko-llama-2-jindo-7b-instruct. \| using [gradio](https://github.com/dsdanielpark/gradio) on [colab](https://colab.research.google.com/drive/1zwR7rz6Ym53tofCGwZZU8y5K_t1r1qqo#scrollTo=p2xw_g80xMsD) \| - \|
	\| danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq \| using [text-generation-webui](https://github.com/oobabooga/text-generation-webui) on [colab](https://colab.research.google.com/drive/19ihYHsyg_5QFZ_A28uZNR_Z68E_09L4G) \| gptq \|
	\| danielpark/ko-llama-2-jindo-7b-instruct-ggml \| [koboldcpp-v1.38](https://github.com/LostRuins/koboldcpp/releases/tag/v1.38) \| ggml \|