Abbey4799/kw-cutegpt-13b-ift-lora

CuteGPT is an open-source conversational language model that supports both Chinese and English, developed by Fudan University KnowledgeWorks Laboratory. It is based on the original Llama model structure, and has a scale of 13B (13 billion) parameters. It can perform int8 precision inference on a single 3090 graphics card. CuteGPT expands the Chinese vocabulary and performs pre-training on the Llama model, improving its ability to understand Chinese. Subsequently, it is fine-tuned with conversational instructions to enhance the model's ability to understand instructions. Based on the KW-CuteGPT-7b version, KW-CuteGPT-13b has improved accuracy in knowledge, understanding of complex instructions, ability to comprehend long texts, reasoning ability, faithful question answering, and other capabilities. Currently, the KW-CuteGPT-13b version model outperforms the majority of models of similar scale in certain evaluation tasks.

from transformers import LlamaForCausalLM, LlamaTokenizer
from peft import PeftModel
import torch

The prompt template for inference

overall_instruction = "你是复旦大学知识工场实验室训练出来的语言模型CuteGPT。给定任务描述，请给出对应请求的回答。\n"
def generate_prompt(query, history, input=None):
    prompt = overall_instruction
    for i, (old_query, response) in enumerate(history):
        prompt += "Q: {}\nA: {}\n".format(old_query, response)
    prompt += "Q: {}\nA: ".format(query)
    return prompt

Load model, tokenizer, here we use lora version of checkpoint

w/o 8bit quantization

model_name = "XuYipei/kw-cutegpt-13b-base"
LORA_WEIGHTS = "Abbey4799/kw-cutegpt-13b-ift-lora"
tokenizer = LlamaTokenizer.from_pretrained(LORA_WEIGHTS)
model = LlamaForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.eval()
model = PeftModel.from_pretrained(model, LORA_WEIGHTS).to(torch.float16)
device = torch.device("cuda")

w/ 8bit quantization (The performance of the model will experience some degradation after quantization.)

model_name = "XuYipei/kw-cutegpt-13b-base"
LORA_WEIGHTS = "Abbey4799/kw-cutegpt-13b-ift-lora"
tokenizer = LlamaTokenizer.from_pretrained(LORA_WEIGHTS)
model = LlamaForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.eval()
model = PeftModel.from_pretrained(model, LORA_WEIGHTS)
device = torch.device("cuda")

Inference

history = []
queries = ['请推荐五本名著，依次列出作品名、作者','再来三本呢？']
memory_limit = 3 # the number of (query, response) to remember
for query in queries:
    prompt = generate_prompt(query, history)
    print(prompt)

    input_ids = tokenizer(prompt, return_tensors="pt", padding=False, truncation=False, add_special_tokens=False)
    input_ids = input_ids["input_ids"].to(device)

    with torch.no_grad():
        outputs=model.generate(
                input_ids=input_ids,
                top_p=0.8,
                top_k=50,
                repetition_penalty=1.1,
                max_new_tokens = 256,
                early_stopping = True,
                eos_token_id = tokenizer.convert_tokens_to_ids('<s>'),
                pad_token_id = tokenizer.eos_token_id,
                min_length = input_ids.shape[1] + 1
        )
    s = outputs[0][input_ids.shape[1]:]
    response=tokenizer.decode(s)
    response = response.replace('<s>', '').replace('<end>', '').replace('</s>', '')
    print(response)
    history.append((query, response))
    history = history[-memory_limit:]