Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -16,42 +16,3 @@ Training a multilingual 176 billion parameters model in the open
|
|
16 |
The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer
|
17 |
|
18 |
You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
|
19 |
-
|
20 |
-
## Summary of the model, dataset, hardware, training and environmental considerations:
|
21 |
-
|
22 |
-
### **The model**
|
23 |
-
|
24 |
-
- 176B parameters decoder-only architecture (GPT-like)
|
25 |
-
- 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
|
26 |
-
- ALiBi positional embeddings - GeLU activation function
|
27 |
-
- **More information**:
|
28 |
-
- [Blog post summarizing how the architecture, size, shape, and pre-training duration where selected](https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours)
|
29 |
-
- [More details on the architecture/optimizer](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
|
30 |
-
|
31 |
-
## **The dataset**
|
32 |
-
|
33 |
-
- Multilingual: 46 languages: Full list is [here](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
34 |
-
- 341.6 billion tokens (1.5 TB of text data)
|
35 |
-
- Tokenizer vocabulary: 250 680 tokens
|
36 |
-
- More information:
|
37 |
-
- [Blog post detailing the design choices during the dataset creation](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
38 |
-
|
39 |
-
## **The engineering side**
|
40 |
-
|
41 |
-
- number of GPU used for the training: 384 A100 GPU with 80 Gb of memory each
|
42 |
-
- one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
|
43 |
-
- checkpoint size: only the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB
|
44 |
-
- training throughput: about 150 TFLOPs
|
45 |
-
- estimated training time: 3-4 months depending on throughput and unexpected events
|
46 |
-
- **More information**:
|
47 |
-
- [Blog post on the hardware/engineering side](https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model)
|
48 |
-
- [Details on the distributed setup used for the training](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
|
49 |
-
- [Tensorboard updated during the training](https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss)
|
50 |
-
- [Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions)](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md)
|
51 |
-
|
52 |
-
## **Environmental considerations**
|
53 |
-
|
54 |
-
- [Jean Zay](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html), the supercomputer we are using for model training, is mostly powered by nuclear energy, which is a low carbon energy source.
|
55 |
-
- Significant efforts were made to make sure that the computing infrastructure is as efficient as possible — the heat generated by the hardware even gets used for heating buildings on campus!
|
56 |
-
- **More information**:
|
57 |
-
- We are currently working on making a precise estimate of the carbon emitted during all of the steps of model training, including intermediate experiments as well as inference. More soon!
|
|
|
16 |
The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer
|
17 |
|
18 |
You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|