File size: 5,933 Bytes
c049732
 
 
 
 
 
 
 
 
 
 
 
 
53976ce
 
 
 
3e6db37
53976ce
285bab3
53976ce
3e6db37
53976ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab7a979
53976ce
 
 
440a8a9
 
 
 
 
 
 
 
53976ce
440a8a9
53976ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: openrail
datasets:
- allenai/c4
language:
- en
metrics:
- glue
pipeline_tag: fill-mask
tags:
- cramming
- bert
- NLU
---



# crammed BERT (legacy/v1)

This is one of the final models described in the **FIRST VERSION OF** "Cramming: Training a Language Model on a Single GPU in One Day". This is an *English*-language model pretrained like BERT, but with less compute. This one was trained for 24 hours on a single A6000 GPU. To use this model, you need the code from the repo at https://github.com/JonasGeiping/cramming tagged v1.13.

You can find the paper here (linked to the old version on arxiv): https://arxiv.org/abs/2212.14034/v1, and the abstract below:

> Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question:
How far can we get with a single GPU in just one day?
> We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.


## Intended uses & limitations

This is the raw pretraining checkpoint. You can use this to fine-tune on a downstream task like GLUE as discussed in the paper. This model is provided only as sanity check for research purposes, it is untested and unfit for deployment.

### How to use


```python
import cramming
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("JonasGeiping/crammed-bert")
model  = AutoModelForMaskedLM.from_pretrained("JonasGeiping/crammed-bert-legacy")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# The c5 variant can only run on CUDA with AMP autocasting.
model.cuda()
cuda_input = {k:i.cuda() for k,i in encoded_input.items()}

with torch.autocast("cuda"):
    output = model(**cuda_input)

```
If you want to use the `c5` model (which include flash-attention) on `cpu`, load the config with `config.arch["attention"]["type"] = "pytorch"` instead and convert all missing weights.


### Limitations and bias

The training data used for this model was further filtered and sorted beyond the normal C4 (not that normal C4 is particularly high quality). These modifications were not tested for unintended consequences.

## Training data, Training procedure, Preprocessing, Pretraining

These are discussed in the paper. You can find the final configurations for each in this repository.
* `data_budget_hours_24.json` is the data configuration
* `train_budget_hours_24.json` is the training configuration
* `arch_budget_hours_24.json` is the architecture configuration

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

| Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|      | 83.9/84.1  | 87.3  | 89.5 | 92.2  | 44.5 | 84.6  | 87.5  | 53.8| 78.6   |

This numbers are median over 5 trials on "GLUE-sane" using the GLUE-dev set. With this variant of GLUE, finetuning cannot be longer than 5 epochs on each task, and hyperparameters have to be chosen equal for all tasks.

### BibTeX entry and citation info

```bibtex
@article{geiping_cramming_2022,
  title = {Cramming: {{Training}} a {{Language Model}} on a {{Single GPU}} in {{One Day}}},
  shorttitle = {Cramming},
  author = {Geiping, Jonas and Goldstein, Tom},
  year = {2022},
  month = dec,
  eprint = {2212.14034},
  eprinttype = {arxiv},
  primaryclass = {cs},
  publisher = {{arXiv}},
  doi = {10.48550/arXiv.2212.14034},
  url = {http://arxiv.org/abs/2212.14034},
  urldate = {2023-01-10},
  abstract = {Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.},
  archiveprefix = {arXiv},
  keywords = {Computer Science - Computation and Language,Computer Science - Machine Learning},
  journal = {arxiv:2212.14034[cs]}
}
```