Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,7 @@ datasets:
|
|
12 |
## Model Description
|
13 |
|
14 |
Llama-2-7B-32K-Instruct is an open-source, long-context chat model finetuned from [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K), over high-quality instruction and chat data.
|
15 |
-
We built Llama-2-7B-32K-Instruct with less than 200 lines of Python script using [Together API](https://together.ai/blog/api-announcement), and we also make the [recipe fully available](https://github.com/togethercomputer/
|
16 |
We hope that this can enable everyone to finetune their own version of [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K) — play with [Together API](https://together.ai/blog/api-announcement) and give us feedback!
|
17 |
|
18 |
## Data Collection Details
|
@@ -21,7 +21,7 @@ Llama-2-7B-32K-Instruct is fine-tuned over a combination of two parts:
|
|
21 |
1. **19K single- and multi-round conversations generated by human instructions and [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) outputs**.
|
22 |
We collected the dataset following the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca — producing instructions by querying a powerful LLM (in this case, [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)).
|
23 |
The complete dataset is also released [here](https://huggingface.co/datasets/togethercomputer/llama-instruct).
|
24 |
-
We also share the complete recipe for the data collection process [here](https://github.com/togethercomputer/
|
25 |
|
26 |
2. **Long-context Summarization and Long-context QA**.
|
27 |
We follow the recipe of [Llama-2-7B-32K](https://together.ai/blog/Llama-2-7B-32K), and train our model with the [BookSum dataset](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections) and [Multi-document Question Answering](https://arxiv.org/abs/2307.03172).
|
@@ -43,7 +43,6 @@ pip install flash-attn --no-build-isolation
|
|
43 |
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
|
44 |
```
|
45 |
You can load the model directly from the Hugging Face model hub using
|
46 |
-
|
47 |
```python
|
48 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
49 |
|
@@ -101,29 +100,43 @@ This poem captures the essence of cats, highlighting their beauty, independence,
|
|
101 |
|
102 |
## Model Evaluation
|
103 |
|
104 |
-
We evaluate the model from three aspects: 1) [
|
105 |
2) [Rouge score over BookSum](https://together.ai/blog/Llama-2-7B-32K); and
|
106 |
-
3) [Accuracy over Multi-document Question Answering (MQA)](https://together.ai/blog/Llama-2-7B-32K).
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
* Rouge Score over BookSum
|
115 |
| Model | R1 | R2 | RL |
|
116 |
| -------- | ------- | ------- | ------- |
|
117 |
-
|
|
118 |
-
|
|
|
|
|
|
119 |
|
120 |
* Accuracy over MQA
|
121 |
| Model | 20 docs (Avg 2.9K tokens) | 30 docs (Avg 4.4K tokens) | 50 docs (Avg 7.4K tokens) |
|
122 |
| -------- | ------- | ------- | ------- |
|
123 |
-
|
|
124 |
-
|
|
|
|
|
|
125 |
|
126 |
-
We observe that Llama-2-7B-32K-Instruct
|
127 |
|
128 |
## Limitations and Bias
|
129 |
|
|
|
12 |
## Model Description
|
13 |
|
14 |
Llama-2-7B-32K-Instruct is an open-source, long-context chat model finetuned from [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K), over high-quality instruction and chat data.
|
15 |
+
We built Llama-2-7B-32K-Instruct with less than 200 lines of Python script using [Together API](https://together.ai/blog/api-announcement), and we also make the [recipe fully available](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct).
|
16 |
We hope that this can enable everyone to finetune their own version of [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K) — play with [Together API](https://together.ai/blog/api-announcement) and give us feedback!
|
17 |
|
18 |
## Data Collection Details
|
|
|
21 |
1. **19K single- and multi-round conversations generated by human instructions and [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) outputs**.
|
22 |
We collected the dataset following the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca — producing instructions by querying a powerful LLM (in this case, [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)).
|
23 |
The complete dataset is also released [here](https://huggingface.co/datasets/togethercomputer/llama-instruct).
|
24 |
+
We also share the complete recipe for the data collection process [here](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct).
|
25 |
|
26 |
2. **Long-context Summarization and Long-context QA**.
|
27 |
We follow the recipe of [Llama-2-7B-32K](https://together.ai/blog/Llama-2-7B-32K), and train our model with the [BookSum dataset](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections) and [Multi-document Question Answering](https://arxiv.org/abs/2307.03172).
|
|
|
43 |
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
|
44 |
```
|
45 |
You can load the model directly from the Hugging Face model hub using
|
|
|
46 |
```python
|
47 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
48 |
|
|
|
100 |
|
101 |
## Model Evaluation
|
102 |
|
103 |
+
We evaluate the model from three aspects: 1) [Alpaca Eval](https://tatsu-lab.github.io/alpaca_eval/);
|
104 |
2) [Rouge score over BookSum](https://together.ai/blog/Llama-2-7B-32K); and
|
105 |
+
3) [Accuracy over Multi-document Question Answering (MQA)](https://together.ai/blog/Llama-2-7B-32K).
|
106 |
+
We compare with models including [https://huggingface.co/meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),
|
107 |
+
[Longchat-7b-16k](https://huggingface.co/lmsys/longchat-7b-16k)
|
108 |
+
and [Longchat-7b-v1.5-32k](https://huggingface.co/lmsys/longchat-7b-v1.5-32k).
|
109 |
+
We summarize the results below:
|
110 |
+
|
111 |
+
* Alpaca Eval
|
112 |
+
| Model | win_rate | standard_error | n_total | avg_length |
|
113 |
+
| -------- | ------- | ------- | ------- | ------- |
|
114 |
+
| Llama-2-7B-Chat-hf | 71.37 | 1.59 | 805 | 1479 |
|
115 |
+
| Llama-2-7B-32K-Instruct | 70.36 | 1.61 | 803 | 1885 |
|
116 |
+
| oasst-rlhf-llama-33b | 66.52 | 1.66 | 805 | 1079 |
|
117 |
+
| text_davinci_003 | 50.00 | 0.00 | 805 | 307|
|
118 |
+
| falcon-40b-instruct | 45.71 | 1.75 | 805 | 662 |
|
119 |
+
| alpaca-farm-ppo-human | 41.24 | 1.73 | 805 | 803 |
|
120 |
+
| alpaca-7b | 26.46 | 1.54 | 805 | 396 |
|
121 |
+
| text_davinci_001 | 15.17 | 1.24 | 804 | 296 |
|
122 |
|
123 |
* Rouge Score over BookSum
|
124 |
| Model | R1 | R2 | RL |
|
125 |
| -------- | ------- | ------- | ------- |
|
126 |
+
| Llama-2-7B-Chat-hf | 0.055 | 0.008 | 0.046 |
|
127 |
+
| Longchat-7b-16k | 0.303 | 0.055 | 0.160 |
|
128 |
+
| Longchat-7b-v1.5-32k | 0.308 | 0.057 | 0.163 |
|
129 |
+
| Llama-2-7B-32K-Instruct (ours) | 0.336 | 0.076 | 0.184 |
|
130 |
|
131 |
* Accuracy over MQA
|
132 |
| Model | 20 docs (Avg 2.9K tokens) | 30 docs (Avg 4.4K tokens) | 50 docs (Avg 7.4K tokens) |
|
133 |
| -------- | ------- | ------- | ------- |
|
134 |
+
| Llama-2-7B-Chat-hf | 0.384 | 0.375 | 0.313 |
|
135 |
+
| Longchat-7b-16k | 0.510 | 0.473 | 0.428 |
|
136 |
+
| Longchat-7b-v1.5-32k | 0.534 | 0.516 | 0.479 |
|
137 |
+
| Llama-2-7B-32K-Instruct (ours) | 0.622 | 0.604 | 0.589 |
|
138 |
|
139 |
+
We observe that our finetuned Llama-2-7B-32K-Instruct consistently outperforms other baseline models including Llama-2-7b-chat, Longchat-7b-16k and Longchat-7b-v1.5-32k.
|
140 |
|
141 |
## Limitations and Bias
|
142 |
|