|
---
|
|
license: other
|
|
license_name: nvidia-open-model-license
|
|
license_link: >-
|
|
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
|
|
---
|
|
|
|
# Nemotron-4-Minitron-4B-Base
|
|
|
|
Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.
|
|
|
|
Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper](https://arxiv.org/abs/2407.14679) for more details.
|
|
|
|
Minitron models are for research and development only.
|
|
|
|
## HuggingFace Quickstart
|
|
|
|
Support for Nemotron models will be added in the upcoming transformers library release. In the meantime, please install the library from source:
|
|
|
|
```
|
|
pip install git+https://github.com/huggingface/transformers
|
|
```
|
|
|
|
The following code provides an example of how to load the Minitron-4B model and use it to perform text generation.
|
|
|
|
```python
|
|
import torch
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
# Load the tokenizer and model
|
|
model_path = 'nvidia/Nemotron-4-Minitron-4B-Base'
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
|
|
|
device = 'cuda'
|
|
dtype = torch.bfloat16
|
|
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device)
|
|
|
|
# Prepare the input text
|
|
prompt = 'Complete the paragraph: our solar system is'
|
|
inputs = tokenizer.encode(prompt, return_tensors='pt').to(model.device)
|
|
|
|
# Generate the output
|
|
outputs = model.generate(inputs, max_length=20)
|
|
|
|
# Decode and print the output
|
|
output_text = tokenizer.decode(outputs[0])
|
|
print(output_text)
|
|
```
|
|
|
|
## License
|
|
|
|
Minitron is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
|
|
|
|
## Evaluation Results
|
|
|
|
*5-shot performance.* Language Understanding evaluated using [Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300):
|
|
|
|
| Average |
|
|
| :---- |
|
|
| 58.6 |
|
|
|
|
*Zero-shot performance.* Evaluated using select datasets from the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) with additions:
|
|
|
|
| HellaSwag | Winogrande | GSM8K| ARC-C | XLSum |
|
|
| :------------- | :------------- | :------------- | :------------- | :------------- |
|
|
| 75.0 | 74.0 | 24.1 | 50.9 | 29.5
|
|
|
|
|
|
*Code generation performance*. Evaluated using [HumanEval](https://github.com/openai/human-eval):
|
|
|
|
| p@1, 0-Shot |
|
|
| :------------- |
|
|
| 23.3 |
|
|
|
|
Please refer to our [paper](https://arxiv.org/abs/2407.14679) for the full set of results.
|
|
|
|
## Citation
|
|
|
|
If you find our work helpful, please consider citing our paper:
|
|
```
|
|
@article{minitron2024,
|
|
title={Compact Language Models via Pruning and Knowledge Distillation},
|
|
author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
|
|
journal={arXiv preprint arXiv:2407.14679},
|
|
year={2024},
|
|
url={https://arxiv.org/abs/2407.14679},
|
|
}
|
|
```
|
|
|