|
--- |
|
license: gpl-3.0 |
|
datasets: |
|
- JosephusCheung/GuanacoDataset |
|
- yahma/alpaca-cleaned |
|
language: |
|
- en |
|
- zh |
|
- ja |
|
tags: |
|
- llama |
|
- guanaco |
|
- alpaca |
|
- lora |
|
- finetune |
|
--- |
|
|
|
# Guanaco-leh-V2: A Multilingual Instruction-Following Language Model Based on LLaMA 7B |
|
This model is trained with [guanaco-lora](https://github.com/KohakuBlueleaf/guanaco-lora) with lora + embed_tokens + lm_head be trained. |
|
|
|
The dataset is from [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) and [guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset). |
|
With trained embed and head, the model perform better at Chinese and Japanese then original LLaMA, and with instruction based prompt. You can use this model more easily. |
|
|
|
Since this model is trained by guanaco dataset, you can also use this as chatbot. just use this format: |
|
``` |
|
### Instruction: |
|
User: <Message history> |
|
Assistant: <Message history> |
|
|
|
### Input: |
|
System: <System response for next message, optional> |
|
User: <Next message> |
|
|
|
### Response: |
|
``` |
|
|
|
**Tips: I just removed the first line of original prompt to reduce token comsumption, plz consider remove it when you want to use this model** |
|
|
|
## Difference between previous model |
|
The main differences are: |
|
* model is trained on bf16 not 8bit |
|
* ctx cut off length increased to 1024 |
|
* use larger dataset (latest guanaco + alpaca cleand = 540k entries) |
|
* use larger batch size (64->128) |
|
|
|
And since the train data has more chat-based data. |
|
This model is more fit in chatbot usage. |
|
|
|
|
|
## Try this model: |
|
You can try this model with this [colab](https://colab.research.google.com/drive/1nn6TCAKyFrgDEgA6X3o3YbxfbMm8Skp4). |
|
Or using generate.py in the [guanaco-lora](https://github.com/KohakuBlueleaf/guanaco-lora), all the examples are generated by guanaco-lora. |
|
|
|
If you want to use the lora model from guanaco-7b-leh-v2-adapter/ , remember to turn off the load_in_8bit, or manually merge it into 7B model! |
|
|
|
### Recommend Generation parameters: |
|
* temperature: 0.5~0.7 |
|
* top p: 0.65~1.0 |
|
* top k: 30~50 |
|
* repeat penalty: 1.03~1.17 |
|
|
|
|
|
## Training Setup |
|
* 2x3090 with model parallel |
|
* batch size = bsz 8 * grad acc 16 = 128 |
|
* ctx cut off length = 1024 |
|
* only train on output (with loss mask) |
|
* enable group of len |
|
* 538k entries, 2epoch (about 8400 step) |
|
* lr 2e-4 |
|
|
|
|
|
## Some Example |
|
(As you can see, although guanaco can reply fluently, the content is quite confusing. So you may want to add some thing in the system part.) |
|
![](https://i.imgur.com/Hxyf3tR.png) |
|
![](https://i.imgur.com/Mu06jxn.png) |
|
|
|
I use guanaco with instruction to let it translate a chinese article to JP/DE/EN. |
|
And use gpt-4 to scoring them and get this: |
|
![](https://i.imgur.com/NfFQbZ2.png) |
|
|
|
## Some more information |
|
|
|
### Why use lora+embed+head |
|
First, I think it is obvious that when a LLM isn't good at some language and you want to ft for it. You should train the embed and head part.<br> |
|
But the question is: "Why not just native finetune?"<br> |
|
If you have searched for some alpaca model or training thing, you may notice that lot of them has 1 problem: "memorize".<br> |
|
The loss will drop at the begin of every epoch, just like some kind of "overfit".<br> |
|
And in my opinion, this is because that the number of params of LLaMA is too large. So it just memorize all the training data. |
|
|
|
But if I use lora for attention part(ignore MLP part), the param number is not large enough for "memorizing training data", so it is more unlikely to memorize all the things. |