metadata

license: llama2

Sample repository

Development Status :: 2 - Pre-Alpha
Developed by MinWoo Park, 2023, Seoul, South Korea. Contact: parkminwoo1991@gmail.com.

danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq model card

4-bit quantization and 128 group size weight of danielpark/ko-llama-2-jindo-7b-instruct
GPTQ is the state-of-the-art one-shot weight quantization method. This code is built upon GPTQ, GPTQ-for-LLaMa, GPTQ-triton, Auto-GPTQ.

Prompt Template

### System:
{System}

### User:
{User}

### Assistant:
{Assistant}

Inference

Install AutoGPTQ for generating.

$ pip install auto-gptq

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM

# Set config
MODEL_NAME_OR_PATH = "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq"
MODEL_BASENAME = "gptq_model-4bit-128g"
USE_TRITON = False
MODEL, TOKENIZER = AutoGPTQForCausalLM.from_quantized(
    MODEL_NAME_OR_PATH,
    model_basename=MODEL_BASENAME,
    use_safetensors=True,
    trust_remote_code=True,
    device="cuda:0",
    use_triton=USE_TRITON,
    quantize_config=None
), AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, use_fast=True)


def generate_text_with_model(prompt):
    prompt_template = f"{prompt}\n"
    input_ids = TOKENIZER(prompt_template, return_tensors='pt').input_ids.cuda()
    output = MODEL.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
    generated_text = TOKENIZER.decode(output[0])
    return generated_text

def generate_text_with_pipeline(prompt):
    logging.set_verbosity(logging.CRITICAL)
    pipe = pipeline(
        "text-generation",
        model=MODEL,
        tokenizer=TOKENIZER,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.15
    )
    prompt_template = f"{prompt}\n"
    generated_text = pipe(prompt_template)[0]['generated_text']
    return generated_text

# Example
prompt_text = "What is GPTQ?"
generated_text_model = generate_text_with_model(prompt_text)
print(generated_text_model)

generated_text_pipeline = generate_text_with_pipeline(prompt_text)
print(generated_text_pipeline)

Web Demo

I implement the web demo using several popular tools that allow us to rapidly create web UIs.

model	web ui	quantinized
danielpark/ko-llama-2-jindo-7b-instruct.	using gradio on colab	-
danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq	using text-generation-webui on colab	gptq
danielpark/ko-llama-2-jindo-7b-instruct-ggml	koboldcpp-v1.38	ggml