File size: 2,204 Bytes
a5302c8
 
 
 
6201dbd
a5302c8
6201dbd
ce5e160
a5302c8
 
ce5e160
5b78297
 
 
6201dbd
 
99acfd8
a5302c8
 
b912772
 
 
 
 
856cb56
b912772
99acfd8
a5302c8
99acfd8
 
a5302c8
99acfd8
a5302c8
99acfd8
 
 
 
 
 
ce5e160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
language: 
  - en
datasets:
  - natural_instructions
  - the_pile
  - cot
  - Muennighoff/P3
tags:
  - gpt
pipeline_tag: text-generation
inference:
  parameters:
    temperature: 0.0
widget:
  - text: "Where is Zurich? Ans:"
  - text: "What is the highest mountain? Answer:"
---

<div id="image" style="display:inline;">
    <img src="https://toma.together.xyz/logo.svg" width="110"/>
    <h1>TOGETHER<h1/>
</div>

***!!! Be careful, this repo is still under construction. The content might change recently. !!!***

# Model Summary

We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.

# Quick Start

```python
from transformers import pipeline

pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')

pipe("Where is Zurich? Ans:")
```

# Training Data

We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
- [the pile](https://huggingface.co/datasets/the_pile)

The pile is used to keep the general ability of GPT-J.
Others are instruction-tuning datasets.

# Hyperparameters

We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.

# Infrastructure

We used [the Together Research Computer](https://together.xyz/) to conduct training. 
Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs. 
Together Research Computer connects clusters at Stanford University, ETH Zurich, Open Science Grid, and University of Wisconsin-Madison.