|
--- |
|
license: llama2 |
|
--- |
|
|
|
|
|
# Sample repository |
|
|
|
Development Status :: 2 - Pre-Alpha <br> |
|
Developed by MinWoo Park, 2023, Seoul, South Korea. [Contact: parkminwoo1991@gmail.com](mailto:parkminwoo1991@gmail.com). |
|
[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fhuggingface.co%2Fdanielpark%2Fko-llama-2-jindo-7b-instruct-4bit-128g-gptq&count_bg=%23000000&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=views&edge_flat=false)](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq) |
|
|
|
# danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq model card |
|
- 4-bit quantization and 128 group size weight of [danielpark/ko-llama-2-jindo-7b-instruct](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct) |
|
- GPTQ is the state-of-the-art one-shot weight quantization method. This code is built upon [GPTQ](https://github.com/IST-DASLab/gptq), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [GPTQ-triton](https://github.com/fpgaminer/GPTQ-triton), [Auto-GPTQ](https://github.com/PanQiWei/AutoGPTQ). |
|
|
|
|
|
|
|
## Prompt Template |
|
|
|
``` |
|
### System: |
|
{System} |
|
|
|
### User: |
|
{User} |
|
|
|
### Assistant: |
|
{Assistant} |
|
``` |
|
|
|
# Inference |
|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lNDLSGR4_prc1QWYrbbhsgpYwYNkklzg?usp=sharing) |
|
Install [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) for generating. |
|
``` |
|
$ pip install auto-gptq |
|
``` |
|
|
|
```python |
|
from transformers import AutoTokenizer, pipeline, logging |
|
from auto_gptq import AutoGPTQForCausalLM |
|
|
|
# Set config |
|
MODEL_NAME_OR_PATH = "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq" |
|
MODEL_BASENAME = "gptq_model-4bit-128g" |
|
USE_TRITON = False |
|
MODEL, TOKENIZER = AutoGPTQForCausalLM.from_quantized( |
|
MODEL_NAME_OR_PATH, |
|
model_basename=MODEL_BASENAME, |
|
use_safetensors=True, |
|
trust_remote_code=True, |
|
device="cuda:0", |
|
use_triton=USE_TRITON, |
|
quantize_config=None |
|
), AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, use_fast=True) |
|
|
|
|
|
def generate_text_with_model(prompt): |
|
prompt_template = f"{prompt}\n" |
|
input_ids = TOKENIZER(prompt_template, return_tensors='pt').input_ids.cuda() |
|
output = MODEL.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) |
|
generated_text = TOKENIZER.decode(output[0]) |
|
return generated_text |
|
|
|
def generate_text_with_pipeline(prompt): |
|
logging.set_verbosity(logging.CRITICAL) |
|
pipe = pipeline( |
|
"text-generation", |
|
model=MODEL, |
|
tokenizer=TOKENIZER, |
|
max_new_tokens=512, |
|
temperature=0.7, |
|
top_p=0.95, |
|
repetition_penalty=1.15 |
|
) |
|
prompt_template = f"{prompt}\n" |
|
generated_text = pipe(prompt_template)[0]['generated_text'] |
|
return generated_text |
|
|
|
# Example |
|
prompt_text = "What is GPTQ?" |
|
generated_text_model = generate_text_with_model(prompt_text) |
|
print(generated_text_model) |
|
|
|
generated_text_pipeline = generate_text_with_pipeline(prompt_text) |
|
print(generated_text_pipeline) |
|
|
|
``` |
|
|
|
|
|
## Web Demo |
|
I implement the web demo using several popular tools that allow us to rapidly create web UIs. |
|
| model | web ui | quantinized | |
|
| --- | --- | --- | |
|
| danielpark/ko-llama-2-jindo-7b-instruct. | using [gradio](https://github.com/dsdanielpark/gradio) on [colab](https://colab.research.google.com/drive/1zwR7rz6Ym53tofCGwZZU8y5K_t1r1qqo#scrollTo=p2xw_g80xMsD) | - | |
|
| danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq | using [text-generation-webui](https://github.com/oobabooga/text-generation-webui) on [colab](https://colab.research.google.com/drive/19ihYHsyg_5QFZ_A28uZNR_Z68E_09L4G) | gptq | |
|
| danielpark/ko-llama-2-jindo-7b-instruct-ggml | [koboldcpp-v1.38](https://github.com/LostRuins/koboldcpp/releases/tag/v1.38) | ggml | |
|
|
|
|
|
|
|
|