Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CuteGPT is an open-source conversational language model that supports both Chinese and English, developed by Fudan University KnowledgeWorks Laboratory. It has a scale of 13B (13 billion) parameters. It can perform int8 precision inference on a single 3090 graphics card. CuteGPT base is pre-trained on Chinese-English corpus. Subsequently, it is fine-tuned with conversational instructions to enhance the model's ability to understand instructions. Based on the KW-CuteGPT-7b version, KW-CuteGPT-13b has improved accuracy in knowledge, understanding of complex instructions, ability to comprehend long texts, reasoning ability, faithful question answering, and other capabilities. Currently, the KW-CuteGPT-13b version model outperforms the majority of models of similar scale in certain evaluation tasks.

Note: Ask The FAIR team of Meta AI for the license for LLAMA usage first.

from transformers import LlamaForCausalLM, LlamaTokenizer
import torch

def generate_prompt(query, history, input=None):
    prompt = ""
    for i, (old_query, response) in enumerate(history):
        prompt += "{}{}\n<end>".format(old_query, response)
    prompt += "{}".format(query)
    return prompt



# Load model
device = torch.device("cuda:0")
model_name = "/data/dell/xuyipei/my_llama/my_llama_13b/llama_13b_112_sft_v1"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16
)
model.eval()
model = model.to(device)


# Inference
history = []
queries = ['请推荐五本名著,依次列出作品名、作者\n', '请再来三本\n']
memory_limit = 3 # the number of (query, response) to remember
for query in queries:
    prompt = generate_prompt(prompt, history)
    input_ids = tokenizer(query, return_tensors="pt", padding=False, truncation=False, add_special_tokens=False)
    input_ids = input_ids["input_ids"].to(device)

    with torch.no_grad():
        outputs=model.generate(
                input_ids=input_ids,
                top_p=0.8,
                top_k=50,
                repetition_penalty=1.1,
                max_new_tokens = 256,
                early_stopping = True,
                eos_token_id = tokenizer.convert_tokens_to_ids('<end>'),
                pad_token_id = tokenizer.eos_token_id,
                min_length = input_ids.shape[1] + 1
        )
    s = outputs[0]
    response=tokenizer.decode(s)
    response = response.replace('<s>', '').replace('<end>', '').replace('</s>', '')
    print(response)
    history.append((query, response))
    history = history[-memory_limit:]
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.