Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,6 @@ As an autoregressive language model, PubMed GPT 2.7B is also capable of natural
|
|
10 |
|
11 |
This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.edu/) and [MosaicML](https://www.mosaicml.com/).
|
12 |
|
13 |
-
|
14 |
# Table of Contents
|
15 |
|
16 |
- [Model Card for Pubmed GPT 2.7B](#model-card-for--model_id-)
|
@@ -32,8 +31,6 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
|
|
32 |
- [Model Architecture and Objective](#model-architecture-and-objective)
|
33 |
- [Compute Infrastructure](#compute-infrastructure)
|
34 |
|
35 |
-
|
36 |
-
|
37 |
# Model Details
|
38 |
|
39 |
## Model Description
|
@@ -61,21 +58,18 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
|
|
61 |
|
62 |
It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities. It should not be directly used for production or work that may directly impact people.
|
63 |
|
64 |
-
|
65 |
## Downstream Use
|
66 |
|
67 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
68 |
|
69 |
The main way we have used this model is finetuning for downstream question answering tasks, and we recommend using this model that way.
|
70 |
|
71 |
-
|
72 |
## Out-of-Scope Use
|
73 |
|
74 |
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
75 |
|
76 |
We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
|
77 |
|
78 |
-
|
79 |
# Bias, Risks, and Limitations
|
80 |
|
81 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
@@ -83,13 +77,11 @@ We do not recommend using this model for natural language generation in a produc
|
|
83 |
|
84 |
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
|
85 |
|
86 |
-
|
87 |
## Recommendations
|
88 |
|
89 |
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
90 |
While this model is capable of generating natural language text, we have only begun to explore this capability and its limitations. Understanding these limitations is especially important in a domain like medicine. Therefore, **we strongly recommend against using this model in production for natural language generation.**
|
91 |
|
92 |
-
|
93 |
# Training Details
|
94 |
|
95 |
## Training Data
|
@@ -98,7 +90,6 @@ While this model is capable of generating natural language text, we have only be
|
|
98 |
|
99 |
This model was trained on the Pubmed Abstracts and Full Text from [The Pile](https://pile.eleuther.ai/).
|
100 |
|
101 |
-
|
102 |
## Training Procedure
|
103 |
|
104 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
@@ -115,17 +106,14 @@ The model was trained on [MosaicML Cloud](https://www.mosaicml.com/cloud), a pla
|
|
115 |
|
116 |
The training process was very smooth and did not suffer from any divergences.
|
117 |
|
118 |
-
|
119 |
As we were preparing the training run, we were unsure of the benefits of training out to 300B tokens for language model perplexity and downstream task performance. While most models of this scale (e.g. GPT Neo 2.7B) are trained to 300-400B tokens, the datasets those models use are vastly larger than PubMed. For instance, The Pile is 8x the size of its PubMed subcorpora.
|
120 |
|
121 |
-
|
122 |
Fortunately, we did continue to see steady perplexity improvements on the validation and training sets for the entirety of training, and preliminary experiments showed improved downstream task performance as we trained out to the full 300B tokens. Our takeaway from this was that it was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models.
|
123 |
|
124 |
### Preprocessing
|
125 |
|
126 |
The model uses a custom tokenizer trained on the PubMed Abstracts. When building domain specific models we have found it important to use a tokenizer trained on in-domain text to maximize performance on downstream tasks. A key benefit is that common biomedical terms are represented as entire tokens.
|
127 |
|
128 |
-
|
129 |
For instance, all of these following terms are tokenized into single tokens by the biomedical tokenizer and multiple tokens by the standard GPT-2 tokenizer:
|
130 |
|
131 |
|
@@ -137,7 +125,6 @@ For instance, all of these following terms are tokenized into single tokens by t
|
|
137 |
| photosynthesis | photos/ynthesis |
|
138 |
| probiotic | prob/iotic |
|
139 |
|
140 |
-
|
141 |
This allows the model to encode information about these concepts in their individual token representations rather than spread out across subword tokens like “oh” shared with many other terms.
|
142 |
|
143 |
# Environmental Impact
|
|
|
10 |
|
11 |
This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.edu/) and [MosaicML](https://www.mosaicml.com/).
|
12 |
|
|
|
13 |
# Table of Contents
|
14 |
|
15 |
- [Model Card for Pubmed GPT 2.7B](#model-card-for--model_id-)
|
|
|
31 |
- [Model Architecture and Objective](#model-architecture-and-objective)
|
32 |
- [Compute Infrastructure](#compute-infrastructure)
|
33 |
|
|
|
|
|
34 |
# Model Details
|
35 |
|
36 |
## Model Description
|
|
|
58 |
|
59 |
It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities. It should not be directly used for production or work that may directly impact people.
|
60 |
|
|
|
61 |
## Downstream Use
|
62 |
|
63 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
64 |
|
65 |
The main way we have used this model is finetuning for downstream question answering tasks, and we recommend using this model that way.
|
66 |
|
|
|
67 |
## Out-of-Scope Use
|
68 |
|
69 |
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
70 |
|
71 |
We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
|
72 |
|
|
|
73 |
# Bias, Risks, and Limitations
|
74 |
|
75 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
|
|
77 |
|
78 |
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
|
79 |
|
|
|
80 |
## Recommendations
|
81 |
|
82 |
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
83 |
While this model is capable of generating natural language text, we have only begun to explore this capability and its limitations. Understanding these limitations is especially important in a domain like medicine. Therefore, **we strongly recommend against using this model in production for natural language generation.**
|
84 |
|
|
|
85 |
# Training Details
|
86 |
|
87 |
## Training Data
|
|
|
90 |
|
91 |
This model was trained on the Pubmed Abstracts and Full Text from [The Pile](https://pile.eleuther.ai/).
|
92 |
|
|
|
93 |
## Training Procedure
|
94 |
|
95 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
|
|
106 |
|
107 |
The training process was very smooth and did not suffer from any divergences.
|
108 |
|
|
|
109 |
As we were preparing the training run, we were unsure of the benefits of training out to 300B tokens for language model perplexity and downstream task performance. While most models of this scale (e.g. GPT Neo 2.7B) are trained to 300-400B tokens, the datasets those models use are vastly larger than PubMed. For instance, The Pile is 8x the size of its PubMed subcorpora.
|
110 |
|
|
|
111 |
Fortunately, we did continue to see steady perplexity improvements on the validation and training sets for the entirety of training, and preliminary experiments showed improved downstream task performance as we trained out to the full 300B tokens. Our takeaway from this was that it was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models.
|
112 |
|
113 |
### Preprocessing
|
114 |
|
115 |
The model uses a custom tokenizer trained on the PubMed Abstracts. When building domain specific models we have found it important to use a tokenizer trained on in-domain text to maximize performance on downstream tasks. A key benefit is that common biomedical terms are represented as entire tokens.
|
116 |
|
|
|
117 |
For instance, all of these following terms are tokenized into single tokens by the biomedical tokenizer and multiple tokens by the standard GPT-2 tokenizer:
|
118 |
|
119 |
|
|
|
125 |
| photosynthesis | photos/ynthesis |
|
126 |
| probiotic | prob/iotic |
|
127 |
|
|
|
128 |
This allows the model to encode information about these concepts in their individual token representations rather than spread out across subword tokens like “oh” shared with many other terms.
|
129 |
|
130 |
# Environmental Impact
|