Update README.md

259ab09 over 1 year ago

3.6 kB

	---
	license: other
	---
	### Introduction
	Basiclly an update to the old attempt of [vicuna-chinese-replication-beta](https://huggingface.co/keyfan/vicuna-chinese-replication-beta)
	* We adopted an curriculum-learning like approch, starting from simple QAs to reasoning-intensive coding & mathamatical problems. Coincidentally, [Ziya](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) adopted the same idea during SFT period.
	* The base model was changed from [chinese-llama](https://huggingface.co/ziqingyang/chinese-llama-lora-13b) to [chinese-llama-plus](https://huggingface.co/ziqingyang/chinese-llama-plus-lora-13b). However, as observed by [BiLLa](https://github.com/Neutralzz/BiLLa), continue training on Chinese-only corpus significantly increases its perplexity on English corpus, which in turns undermines its abilities in fields like mathematical calculation in our preliminary experiment. The subject of continuing-training is under-studied, while using bilingual corpus may be a better alternative as shown so far.
	* We changed to the Vicuna v1.1 conversative template and included more CoT training data.

	Again, this is for research purpose only. There's no guarantee for its performance. All credits to the original authors of LLaMA and Chinese-LLaMA.

	Compared with previous release, new model improves on coding and reasoning problem. However it still suffers from hallucinations and perform poorly on Chinese domain-specific problem, e.g. chinese literature and idioms.

	### Usage

	We use exactly the Vicuna template for training and inference. Sample code as below.

	```
	from transformers import AutoModelForCausalLM, AutoTokenizer

	checkpoint = "keyfan/vicuna-chinese-replication-v1.1"

	tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
	model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda()

	template = ("A chat between a curious human and an artificial intelligence assistant. "
	"The assistant gives helpful, detailed, and polite answers to the human's questions. "
	"USER: {}\nASSISTANT:")
	question = template.format("Who was the president of the United States in 1955?")
	inputs = tokenizer.encode(question, return_tensors="pt").cuda()
	outputs = model.generate(inputs, do_sample=True, temperature=0.2, max_new_tokens=512)
	print(tokenizer.decode(outputs[0]))
	```

	### Evaluation

	* Result on the [Chinese-LLaMA-Alpaca devset](https://github.com/ymcui/Chinese-LLaMA-Alpaca/tree/main/examples) compared with the result of Alpaca-Plus-13B. For simplity, we only sample one answer for each question without any cherry-picking. We used the template as provided in their repo. GPT-4 have strong bias for more detailed answers, so the score may not be consistent with human evaluation.

	\| Model \| Macro-Average \| QA \| OQA \| REASONING \| LITERATURE \| ENTERTAINMENT \| GENERATION \| TRANSLATION \| CODE \| ETHICS \|
	\| - \| - \| - \| - \| - \| - \| - \| - \| - \| - \| - \|
	\| Alpaca-Plus-13B \| 77.3 \| 70 \| 74 \| 70 \| 80 \| 77 \| 82 \| 89 \| 64 \| 90 \|
	\| ours \| 82.4 \| 81 \| 87 \| 88 \| 73 \| 78 \| 85 \| 83 \| 83 \| 84 \|

	* Result on the newly released [C-Eval test set](https://cevalbenchmark.com/index.html#home) with 5-shot. We slightly modified [MOSS's code](https://github.com/SJTU-LIT/ceval/blob/main/code/evaluator_series/evaluators/moss.py) from ceval codebase by moving the '答案：' (Answer: ) suffix from the end of question to the beginning of the chatbot response.

	\| Average \| Avg(Hard) \| STEM \| Social Science \| Humanities \| Others \|
	\| - \| - \| - \| - \| - \| - \|
	\| 37.0 \| 29.5 \| 34.6 \| 44.5 \| 35.7 \| 35.9 \|