keyfan's picture
Update README.md
259ab09
metadata
license: other

Introduction

Basiclly an update to the old attempt of vicuna-chinese-replication-beta

  • We adopted an curriculum-learning like approch, starting from simple QAs to reasoning-intensive coding & mathamatical problems. Coincidentally, Ziya adopted the same idea during SFT period.
  • The base model was changed from chinese-llama to chinese-llama-plus. However, as observed by BiLLa, continue training on Chinese-only corpus significantly increases its perplexity on English corpus, which in turns undermines its abilities in fields like mathematical calculation in our preliminary experiment. The subject of continuing-training is under-studied, while using bilingual corpus may be a better alternative as shown so far.
  • We changed to the Vicuna v1.1 conversative template and included more CoT training data.

Again, this is for research purpose only. There's no guarantee for its performance. All credits to the original authors of LLaMA and Chinese-LLaMA.

Compared with previous release, new model improves on coding and reasoning problem. However it still suffers from hallucinations and perform poorly on Chinese domain-specific problem, e.g. chinese literature and idioms.

Usage

We use exactly the Vicuna template for training and inference. Sample code as below.

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "keyfan/vicuna-chinese-replication-v1.1"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda()

template = ("A chat between a curious human and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the human's questions. "
            "USER: {}\nASSISTANT:")
question = template.format("Who was the president of the United States in 1955?")
inputs = tokenizer.encode(question, return_tensors="pt").cuda()
outputs = model.generate(inputs, do_sample=True, temperature=0.2, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Evaluation

  • Result on the Chinese-LLaMA-Alpaca devset compared with the result of Alpaca-Plus-13B. For simplity, we only sample one answer for each question without any cherry-picking. We used the template as provided in their repo. GPT-4 have strong bias for more detailed answers, so the score may not be consistent with human evaluation.
Model Macro-Average QA OQA REASONING LITERATURE ENTERTAINMENT GENERATION TRANSLATION CODE ETHICS
Alpaca-Plus-13B 77.3 70 74 70 80 77 82 89 64 90
ours 82.4 81 87 88 73 78 85 83 83 84
  • Result on the newly released C-Eval test set with 5-shot. We slightly modified MOSS's code from ceval codebase by moving the '答案:' (Answer: ) suffix from the end of question to the beginning of the chatbot response.
Average Avg(Hard) STEM Social Science Humanities Others
37.0 29.5 34.6 44.5 35.7 35.9