|
--- |
|
license: cc-by-sa-3.0 |
|
datasets: |
|
- mosaicml/dolly_hhrlhf |
|
tags: |
|
- Composer |
|
- MosaicML |
|
- llm-foundry |
|
--- |
|
|
|
# MPT-7B-Instruct |
|
|
|
MPT-7B-Instruct is a model for short-form instruction following. |
|
It is built by finetuning [MPT-7B (Base)](https://huggingface.co/spaces/mosaicml/mpt-7b) on a [dataset](https://huggingface.co/datasets/sam-mosaic/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets. |
|
* License: _CC-By-SA-3.0_ (commercial use permitted) |
|
* [Online Demo](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct) |
|
|
|
|
|
This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture. |
|
|
|
## Model Date |
|
|
|
May 5, 2023 |
|
|
|
## Model License |
|
|
|
Apache-2.0 (commercial use permitted) |
|
|
|
## Documentation |
|
|
|
* [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](www.mosaicml.com/blog/mpt-7b) |
|
* [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/) |
|
* Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)! |
|
|
|
### Example Question/Instruction |
|
|
|
**Longboi24** |
|
> What is a quoll? |
|
|
|
**MPT-7B-Instruct** |
|
|
|
>A Quoll (pronounced “cool”) is one of Australia’s native carnivorous marsupial mammals, which are also known as macropods or wallabies in other parts around Asia and South America |
|
|
|
## How to Use |
|
|
|
Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package. |
|
|
|
It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more. |
|
|
|
```python |
|
import transformers |
|
model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-instruct', trust_remote_code=True, torch_dtype=torch.bfloat16) |
|
``` |
|
|
|
To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so: |
|
|
|
```python |
|
model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-instruct', trust_remote_code=True, torch_dtype=torch.bfloat16, attn_impl='triton') |
|
model.to(device='cuda:0', dtype=torch.bfloat16) |
|
``` |
|
|
|
Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example: |
|
|
|
```python |
|
config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True) |
|
config.update({"max_seq_len": 4096}) |
|
model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', config=config, trust_remote_code=True) |
|
``` |
|
|
|
This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") |
|
``` |
|
|
|
## Model Description |
|
|
|
The architecture is a modification of a standard decoder-only transformer. |
|
|
|
The model has been modified from a standard transformer in the following ways: |
|
* It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) |
|
* It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings |
|
* It does not use biases |
|
|
|
|
|
| Hyperparameter | Value | |
|
|----------------|-------| |
|
|n_parameters | 6.7B | |
|
|n_layers | 32 | |
|
| n_heads | 32 | |
|
| d_model | 4096 | |
|
| vocab size | 50432 | |
|
| sequence length | 2048 | |
|
|
|
## PreTraining Data |
|
|
|
For more details on the pretraining process, see [MPT-7B](https://huggingface.co/mosaicml/mpt-7b). |
|
|
|
The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. |
|
|
|
## Training Configuration |
|
|
|
This model was finetuned on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). |
|
|
|
## Acknowledgements |
|
|
|
This model was finetuned by Sam Havens and the MosaicML NLP team |