metadata

language:
  - zh
  - en
tags:
  - llama2

LLama2-chat 7B Chinese Version

Introduction

由于目前的LLama2-chat模型很难约束其以中文进行问题回复，因此该模型旨在提供一个能以中文进行问答的LLama2-chat 7B 模型。

该模型使用LLama2-chat 7B 作为基底模型，使用带embedding 和 LM head 的Lora训练方式训练。模型已完成参数合并，可直接使用。也可以手动将sft_lora_model同Llama2-chat进行合并。

训练数据使用BELLE项目中采样的50万SFT数据进行SFT训练。

Since the LLama2-chat model struggles to confine its responses to Chinese language when prompted with Chinese questions, the primary objective of this model is to provide a LLama2-chat 7B model that can engage in question and answer interactions in Chinese.

The model utilizes LLama2-chat 7B as its base model and is trained using the Lora training approach with the embedding and LM head. The model has undergone the Lora param merge and is now ready for direct use. It is also possible to manually merge the ./sft_lora_model with the Llama2-chat 7B model to obtain the combined model.

The training data is sampled from BELLE project, which consists of 500,000 SFT samples.

Train Detail

一些训练上的细节：

训练框架：该模型使用了修改过的Chinese-LLaMA-Alpaca项目进行训练。
Tokenizer：该模型使用了Chinese-Alpaca-Plus模型的tokenizer.model。这是因为LLama2本身的tokenizer.model同LLama1是一摸一样的。因此理论上可以完全复用Chinese-LLaMa项目的tokenizer而不会产生如何错位问题。
训练参数：由于模型需要resize embedding，多出来的embedding等于随即初始化，因此训练前期deepspeed及其容易因“OVERFLOW”而开始reduce loss scale。频繁reduce 后会直接导致scale过小溢出，从而导致训练崩溃。此时不应降低学习率，warmup 等超参，而是应该放大到Pretrain 规模。如此才能让随即初始化的embedding快速走上正轨。
训练资源：8卡V100。21个小时
训练起始的loss：8.7072
训练终止的loss：1.5674

Some details in training:

Trianing Framework: This model is trained on modified Chinese-LLaMA-Alpaca Framework.
Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
Training Resource: 8*V100, 21 hours.
Initial Loss: 8.7072
Train Loss: 1.5674

Inference

该模型依然采用stanford alpaca 模版。因此在测试时且别忘记添加开场白。开场白如下：

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\n${Your Content}\n\n### Response:\n\n"

对于带上文的对话，开场白如下：

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\nHuman:${Previous Human Content}\nAssistant:${Previous Assistance Content}\nHuman:${Your Question}\n\n### Response:\n\n"

This model still using the Stanford Alpaca template. Therefore, don't forget to add prologue template. The prologue template is:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\n${Your Content}\n\n### Response:\n\n"

For dialogue with context, the prelogue template is:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\nHuman:${Previous Human Content}\nAssistant:${Previous Machine Content}\nHuman:${Your Question}\n\n### Response:\n\n"

Licence

本仓库的模型依照 Apache-2.0 协议开源，模型的权重的使用则需要遵循LLama2MODEL LICENCE。

This repository's models are open-sourced under the Apache-2.0 license, and their weight usage must adhere to LLama2 MODEL LICENCE license.

Future Work

将会在近期逐步放出

更大SFT数据规模训练下的模型。
13B及以下的LLama2 同LLama2-chat的模型，以供大家对比。

I will release the following models:

Models trained on larger data scale.
Models trained on LLama2 and LLama2-chat (under the 13B, since I only have V100), for comparison.