Update README.md
Browse files
README.md
CHANGED
@@ -19,43 +19,12 @@ widget:
|
|
19 |
example_title: "Question Answering"
|
20 |
---
|
21 |
|
22 |
-
<h1>TOGETHER RESEARCH<h1/>
|
23 |
-
|
24 |
-
***!!! Be careful, this repo is still under construction. The content might change recently. !!!***
|
25 |
-
|
26 |
-
# Model Summary
|
27 |
-
|
28 |
-
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
|
29 |
-
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
|
30 |
-
|
31 |
# Quick Start
|
32 |
|
33 |
```python
|
34 |
from transformers import pipeline
|
35 |
|
36 |
-
pipe = pipeline(model='togethercomputer/
|
37 |
|
38 |
pipe("Where is Zurich? Ans:")
|
39 |
-
```
|
40 |
-
|
41 |
-
# Training Data
|
42 |
-
|
43 |
-
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
|
44 |
-
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
45 |
-
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
46 |
-
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
47 |
-
- [the pile](https://huggingface.co/datasets/the_pile)
|
48 |
-
|
49 |
-
The pile is used to keep the general ability of GPT-J.
|
50 |
-
Others are instruction-tuning datasets.
|
51 |
-
|
52 |
-
# Hyperparameters
|
53 |
-
|
54 |
-
We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
|
55 |
-
We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
|
56 |
-
We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.
|
57 |
-
|
58 |
-
# Infrastructure
|
59 |
-
|
60 |
-
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
61 |
-
Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs.
|
|
|
19 |
example_title: "Question Answering"
|
20 |
---
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
# Quick Start
|
23 |
|
24 |
```python
|
25 |
from transformers import pipeline
|
26 |
|
27 |
+
pipe = pipeline(model='togethercomputer/GPT-JT-6B-v0')
|
28 |
|
29 |
pipe("Where is Zurich? Ans:")
|
30 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|