juewang commited on
Commit
ce5e160
1 Parent(s): 99acfd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -2
README.md CHANGED
@@ -5,8 +5,11 @@ datasets:
5
  - natural_instructions
6
  - the_pile
7
  - cot
 
8
  tags:
9
  - gpt
 
 
10
  widget:
11
  - text: "Where is Zurich? Ans:"
12
  - text: "What is the highest mountain? Answer:"
@@ -16,7 +19,6 @@ widget:
16
 
17
  We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
18
  The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
19
- We fine-tune GPT-J-6B on NI, P3, COT, the pile data.
20
 
21
  # Quick Start
22
 
@@ -26,4 +28,27 @@ from transformers import pipeline
26
  pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
27
 
28
  pipe("Where is Zurich? Ans:")
29
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - natural_instructions
6
  - the_pile
7
  - cot
8
+ - Muennighoff/P3
9
  tags:
10
  - gpt
11
+ pipeline_tag: text-generation
12
+ inference: true
13
  widget:
14
  - text: "Where is Zurich? Ans:"
15
  - text: "What is the highest mountain? Answer:"
 
19
 
20
  We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
21
  The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
 
22
 
23
  # Quick Start
24
 
 
28
  pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
29
 
30
  pipe("Where is Zurich? Ans:")
31
+ ```
32
+
33
+ # Training Data
34
+
35
+ We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
36
+ - [Natural-Instructions](https://github.com/allenai/natural-instructions)
37
+ - [P3](https://huggingface.co/datasets/Muennighoff/P3)
38
+ - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
39
+ - [the pile](https://huggingface.co/datasets/the_pile)
40
+
41
+ The pile is used to keep the general ability of GPT-J.
42
+ Others are instruction-tuning datasets.
43
+
44
+ # Hyperparameters
45
+
46
+ We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
47
+ We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
48
+ We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.
49
+
50
+ # Infrastructure
51
+
52
+ We used [the Together Research Computer](https://together.xyz/) to conduct training.
53
+ Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs.
54
+ Together Research Computer connects clusters at Stanford University, ETH Zurich, Open Science Grid, and University of Wisconsin-Madison.