|
--- |
|
language: |
|
- en |
|
datasets: |
|
- natural_instructions |
|
- the_pile |
|
- cot |
|
- Muennighoff/P3 |
|
tags: |
|
- gpt |
|
pipeline_tag: text-generation |
|
inference: |
|
parameters: |
|
temperature: 0.0 |
|
widget: |
|
- text: "Where is Zurich? Ans:" |
|
- text: "What is the highest mountain? Answer:" |
|
--- |
|
|
|
<div id="image" style="display:inline;"> |
|
<img src="https://toma.together.xyz/logo.svg" width="110"/> |
|
<h1>TOGETHER<h1/> |
|
</div> |
|
|
|
***!!! Be careful, this repo is still under construction. The content might change recently. !!!*** |
|
|
|
# Model Summary |
|
|
|
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference. |
|
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training. |
|
|
|
# Quick Start |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x') |
|
|
|
pipe("Where is Zurich? Ans:") |
|
``` |
|
|
|
# Training Data |
|
|
|
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data. |
|
- [Natural-Instructions](https://github.com/allenai/natural-instructions) |
|
- [P3](https://huggingface.co/datasets/Muennighoff/P3) |
|
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json) |
|
- [the pile](https://huggingface.co/datasets/the_pile) |
|
|
|
The pile is used to keep the general ability of GPT-J. |
|
Others are instruction-tuning datasets. |
|
|
|
# Hyperparameters |
|
|
|
We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps. |
|
We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32. |
|
We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency. |
|
|
|
# Infrastructure |
|
|
|
We used [the Together Research Computer](https://together.xyz/) to conduct training. |
|
Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs. |
|
Together Research Computer connects clusters at Stanford University, ETH Zurich, Open Science Grid, and University of Wisconsin-Madison. |