yaojialzc
/

Gigi-Llama3-8B-Chinese-zh

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

yaojialzc commited on Apr 21, 2024

Commit

a431d39

·

verified ·

1 Parent(s): c0743e2

Update README.md

Files changed (1) hide show

README.md +23 -24

README.md CHANGED Viewed

@@ -37,42 +37,41 @@ Gigi-Llama-3-8B-zh 遵循 Llama-3-8B-Instruct 的对话模板，pad token 使用
 您可以使用下面代码加载模型推理，对于更高效的推理建议使用vLLM，我们随后会介绍模型的具体性能，并很快更新更大参数和性能更好的精调版本。
 ```python
-import transformers
 import torch
 model_id = "yaojialzc/Gigi-Llama-3-8B-zh"
-pipeline = transformers.pipeline(
-    "text-generation",
-    model=model_id,
-    model_kwargs={"torch_dtype": torch.bfloat16},
-    device="cuda",
-)
 messages = [
-    {"role": "user", "content": "请给我写一个很长的故事"},
 ]
-prompt = pipeline.tokenizer.apply_chat_template(
         messages,
         tokenize=False,
         add_generation_prompt=True
 )
-terminators = [
-    pipeline.tokenizer.eos_token_id,
-    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
-]
-outputs = pipeline(
-    prompt,
-    max_new_tokens=256,
-    eos_token_id=terminators,
     do_sample=True,
-    temperature=0.6,
-    top_p=0.9,
 )
-print(outputs[0]["generated_text"][len(prompt):])
 ```
-llama 3 似乎在设置eos token时有一些问题，导致模型输出 eot 时不会停止，无法开箱即用。我们暂时尊重官方的行为，精调时指导模型在最后输出 end_of_text，方便目前开箱即用地在下游领域精调。

 您可以使用下面代码加载模型推理，对于更高效的推理建议使用vLLM，我们随后会介绍模型的具体性能，并很快更新更大参数和性能更好的精调版本。
 ```python
 import torch
+from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
+from peft import PeftModel
+from torch.nn.functional import softmax
+device = "cuda"
 model_id = "yaojialzc/Gigi-Llama-3-8B-zh"
+tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    device_map="auto",
+    torch_dtype=torch.bfloat16)
 messages = [
+    {"role": "user", "content": "明朝最后一位皇帝是谁？回答他的名字，然后停止输出"},
 ]
+prompt = tokenizer.apply_chat_template(
         messages,
         tokenize=False,
         add_generation_prompt=True
 )
+input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
+output = model.generate(
+    input_ids,
     do_sample=True,
+    temperature=0.01,
+    top_k=50,
+    top_p=0.7,
+    repetition_penalty=1,
+    max_length=128,
+    pad_token_id=tokenizer.eos_token_id,
 )
+output = tokenizer.decode(output[0], skip_special_tokens=False)
+print(output)
 ```
+llama 3 模型输出 eot 时不会停止，无法开箱即用。我们暂时尊重官方的行为，精调时指导模型在最后直接输出 end_of_text，方便目前开箱即用地在下游领域精调。