Update README.md
Browse files
README.md
CHANGED
@@ -5,8 +5,11 @@ datasets:
|
|
5 |
- natural_instructions
|
6 |
- the_pile
|
7 |
- cot
|
|
|
8 |
tags:
|
9 |
- gpt
|
|
|
|
|
10 |
widget:
|
11 |
- text: "Where is Zurich? Ans:"
|
12 |
- text: "What is the highest mountain? Answer:"
|
@@ -16,7 +19,6 @@ widget:
|
|
16 |
|
17 |
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
|
18 |
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
|
19 |
-
We fine-tune GPT-J-6B on NI, P3, COT, the pile data.
|
20 |
|
21 |
# Quick Start
|
22 |
|
@@ -26,4 +28,27 @@ from transformers import pipeline
|
|
26 |
pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
|
27 |
|
28 |
pipe("Where is Zurich? Ans:")
|
29 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
- natural_instructions
|
6 |
- the_pile
|
7 |
- cot
|
8 |
+
- Muennighoff/P3
|
9 |
tags:
|
10 |
- gpt
|
11 |
+
pipeline_tag: text-generation
|
12 |
+
inference: true
|
13 |
widget:
|
14 |
- text: "Where is Zurich? Ans:"
|
15 |
- text: "What is the highest mountain? Answer:"
|
|
|
19 |
|
20 |
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
|
21 |
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
|
|
|
22 |
|
23 |
# Quick Start
|
24 |
|
|
|
28 |
pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
|
29 |
|
30 |
pipe("Where is Zurich? Ans:")
|
31 |
+
```
|
32 |
+
|
33 |
+
# Training Data
|
34 |
+
|
35 |
+
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
|
36 |
+
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
37 |
+
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
38 |
+
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
39 |
+
- [the pile](https://huggingface.co/datasets/the_pile)
|
40 |
+
|
41 |
+
The pile is used to keep the general ability of GPT-J.
|
42 |
+
Others are instruction-tuning datasets.
|
43 |
+
|
44 |
+
# Hyperparameters
|
45 |
+
|
46 |
+
We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
|
47 |
+
We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
|
48 |
+
We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.
|
49 |
+
|
50 |
+
# Infrastructure
|
51 |
+
|
52 |
+
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
53 |
+
Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs.
|
54 |
+
Together Research Computer connects clusters at Stanford University, ETH Zurich, Open Science Grid, and University of Wisconsin-Madison.
|