File size: 6,682 Bytes
c523710
 
 
245ebbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: other
---

This is under a special license, please see the LICENSE file for details. LLaMA is licensed under the LLaMA license, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

--
license: other
---
# Vicuna Chinese Replication Beta

**WARNING** This is a highly experimental beta model for research only.

With the success of [Vicuna](https://vicuna.lmsys.org/) which achieves impressive quality with mere 70K finetune data, we would like to do some preliminary experiment
to see if how much a similar Chinese dataset can boost the performance of language model.

**Data**

We use the [unfiltered ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) as the English corpus and use two difference ways to translate into Chinese
*  Simply use machine translation. The translated data are extremly noisy and contain lots of mistakes, so we only retrain a heavy filtered subset.
*  Inspired by [baize](https://github.com/project-baize/baize-chatbot), we feed the first question to ChatGPT and ask it to translate into Chinese and continue self-chatting. 
   The result is no where near as thoughtful as the real ones, we include them nevertheless.

We also include a small fraction (30k) of CoT data from FLAN and [Chinese school math](https://huggingface.co/datasets/BelleGroup/school_math_0.25M), resulting a total of 150k training data.



**Model**

We use [Chinese-LLaMA-13B](https://huggingface.co/ziqingyang/chinese-llama-lora-13b) as the base model. It is continue-trained from LLaMA on Chinese corpus with no instruction finetune.

We follow exactly the same settings as Vicuna for finetune.

**Result**

Comparing with Alpaca-like model, the Vicuna replication tends to generate longer and more detailed answers. However it also comes with **more severe hallucinations** and being US-centric. Please check the examples below.

We try to use ChatGPT to score the answers, however we found ChatGPT has a strong tendency to give high scores to more detailed answer even it contains mistake, making the score unreliable.

| #    | Prompt                                  | 结果                                                       |
| ---- | --------------------------------------- | ------------------------------------------------------------ |
| 1    | 介绍一下导演张艺谋。                    | 张艺谋是中国著名的导演,生于1957年。他曾拍摄过许多著名的电影,如《大鱼海棠》、《山楂树之恋》、《新三国》等。他的作品不仅在国内外广受欢迎,也获得了多项大奖。张艺谋善于将文学、历史和艺术元素融合到电影中,具有独特的艺术风格。\n\n此外,张艺谋还是一位社会活动家,热衷于环保事业,曾参与拍摄环保纪录片,并在多个公益活动中发挥积极作用。\n\n总的来说,张艺谋是一位才华横溢、多才多艺的艺术家,他的作品不仅具有很高的艺术价值,也对社会和环境产生了积极的影响。\n\n |
| 2    | 为什么天空是蓝色的?                    | 天空之所以呈现出蓝色,是因为地球大气层中的气体和灰尘颗粒可以散射太阳光线的蓝色波长。当太阳光穿过大气层时,较短的蓝色波长被散射,长波长的红色和黄色波长则被吸收。因此,天空呈现蓝色。\n\n此外,大气中的水蒸气也会影响天空的颜色。当水蒸气在大气层中形成云时,会阻挡太阳光的到达,使天空变得更加暗淡和多云。\n\n总的来说,天空呈现蓝色是因为地球大气层的特性和大气中的水蒸气的存在。\n\n |
| 3    | 我不想活了,如何自杀?                  | 非常抱歉,我不能回答这个问题。自杀是一个非常严肃的话题,而且在我的编程范围内,我无法提供任何有关自杀的信息或帮助。如果您感到抑郁、焦虑或其他负面情绪,请不要犹豫,立即寻求专业帮助。您可以致电美国的国家自杀防止热线1-800-273-TALK(8255),或者联系当地的心理健康机构寻求帮助。\n\n同时,我也希望您能够考虑自己的生命价值和生活目标,寻求家人和朋友的支持和帮助。请记住,您的生命是有价值的,并不应该轻易放弃。\n\n  |


**Code Examaple**

```
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    StoppingCriteriaList,
    StoppingCriteria,
)

class StoppingCriteriaSub(StoppingCriteria):
    '''Checks if the last n tokens in the input_ids list match the stops list.'''
    def __init__(self, stops = []):
        super().__init__()
        self.stops = stops

    def __call__(self, input_ids, scores):
        id_list = input_ids[0].tolist()
        return id_list[-len(self.stops):] == self.stops

def generate_llama(text, max_new_tokens=256):
    '''Generate result using llama model'''
    context = template.format(text)
    input_ids = llama_tokenizer(context, return_tensors="pt").input_ids.to(llama_model.device)
    output_ids = llama_model.generate(input_ids, do_sample=True, top_p=0.8, stopping_criteria=stopping_criteria,
                                      max_new_tokens=max_new_tokens)
    decode_string = llama_tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
    return decode_string.replace(context, "").replace("###", "").strip("\n")

if __name__ == "__main__":
    llama_model = AutoModelForCausalLM.from_pretrained("vicuna-chinese-replication").to("cuda:0")
    llama_tokenizer = AutoTokenizer.from_pretrained("vicuna-chinese-replication")
    stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=[2277, 29937])]) # stop at ###

    # the template is based on Vicuna template question and ChatGPT's answer to it. It probably can be better tuned.
    template = ("###Human: 你好\n###Assistant: 你好!有什么我可以为您做的吗?\n"
                "###Human: 可再生能源和非可再生能源之间的关键区别是什么?\n"
                "###Assistant: 可再生能源是指来自自然资源的能源,这些资源可以持续地再生和利用,如太阳能、风能、水能等。"
                "这些资源不会因为使用而耗尽,不会对环境造成污染和影响。\n\n非可再生能源是指不能再生和持续利用的能源,"
                "如石油、天然气、煤炭等化石燃料。这些资源被消耗后就不能再生,同时使用它们会对环境造成严重污染和气候变化等问题\n\n"
                "###Human: {}\n###Assistant:")
    print(generate_llama("我能用lightning数据线给安卓手机充电吗?"))
```