sam-mosaic commited on
Commit
c45ca0a
1 Parent(s): abe8dd5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -21,6 +21,7 @@ inference: false
21
 
22
  MPT-7B-Instruct-8k is a model for long-form instruction following, especially question-answering on and summarization of longer documents.
23
  It is built by finetuning [MPT-7B-8k](https://huggingface.co/mosaicml/mpt-7b-8k) on [Dolly HHRLHF](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets. It is also trained on [Competition Math](https://huggingface.co/datasets/competition_math), [Duorc](https://huggingface.co/datasets/duorc), [CoT GSM8k](https://huggingface.co/datasets/conceptofmind/cot_submix_original), [Qasper](https://huggingface.co/datasets/allenai/qasper), [Quality](https://huggingface.co/datasets/emozilla/quality), [Summ Screen FD](https://huggingface.co/datasets/tau/scrolls) and [Spider](https://huggingface.co/datasets/spider).
 
24
  * License: _CC-By-SA-3.0_
25
  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct-8k)
26
 
@@ -143,18 +144,17 @@ The model has been modified from a standard transformer in the following ways:
143
 
144
  The model was trained on the following data mix:
145
 
146
- | Data Source | Number of Tokens in Source | Proportion |
147
  |-------------|----------------------------|------------|
148
- | Airoboros/GPT4-1.2 | 26.4M | 1.71% |
149
- | Baize | 55.0M | 3.57% |
150
- | Camel | 301M | 19.54% |
151
- | GPTeacher | 7.56M | 0.49% |
152
- | Guanaco | 15.6M | 1.02% |
153
- | LongCoversations | 18.4M | 1.19% |
154
- | ShareGPT | 821M | 53.24% |
155
- | WizardLM | 297M | 19.23% |
156
-
157
- "LongConversations" is a GPT3.5/4-generated dataset, details of which will be released at a later date.
158
 
159
  ### Training Configuration
160
 
 
21
 
22
  MPT-7B-Instruct-8k is a model for long-form instruction following, especially question-answering on and summarization of longer documents.
23
  It is built by finetuning [MPT-7B-8k](https://huggingface.co/mosaicml/mpt-7b-8k) on [Dolly HHRLHF](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets. It is also trained on [Competition Math](https://huggingface.co/datasets/competition_math), [Duorc](https://huggingface.co/datasets/duorc), [CoT GSM8k](https://huggingface.co/datasets/conceptofmind/cot_submix_original), [Qasper](https://huggingface.co/datasets/allenai/qasper), [Quality](https://huggingface.co/datasets/emozilla/quality), [Summ Screen FD](https://huggingface.co/datasets/tau/scrolls) and [Spider](https://huggingface.co/datasets/spider).
24
+ This is the same dataset that [MPT-30B-Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct) was trained on.
25
  * License: _CC-By-SA-3.0_
26
  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct-8k)
27
 
 
144
 
145
  The model was trained on the following data mix:
146
 
147
+ | Data Source | Number of Tokens in Source | Proportion |
148
  |-------------|----------------------------|------------|
149
+ | competition_math | 1.6 M | 3.66% |
150
+ | cot_gsm8k | 3.36 M | 7.67% |
151
+ | dialogsum | 0.1 M | 0.23% |
152
+ | dolly_hhrlhf | 5.89 M | 13.43% |
153
+ | duorc | 7.8 M | 17.80% |
154
+ | qasper | 8.72 M | 19.90% |
155
+ | quality | 11.29 M | 25.78% |
156
+ | scrolls/summ_screen_fd | 4.97 M | 11.33% |
157
+ | spider | 0.089 M | 0.20% |
 
158
 
159
  ### Training Configuration
160