|
--- |
|
license: apache-2.0 |
|
tags: |
|
- nvidia |
|
--- |
|
|
|
## Mistral-NeMo-12B-Base |
|
|
|
[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-12B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets) |
|
|
|
### Model Overview: |
|
|
|
Mistral-NeMo-12B-Base is a Large Language Model (LLM) composed of 12B parameters, trained jointly by NVIDIA and Mistral AI. It significantly outperforms existing models smaller or similar in size. |
|
|
|
**Key features** |
|
- Released under the Apache 2 License |
|
- Pre-trained and instructed versions |
|
- Trained with a 128k context window |
|
- Trained on a large proportion of multilingual and code data |
|
|
|
### Intended use |
|
|
|
Mistral-NeMo-12B-Base is a completion model intended for use in over 80+ programming languages and designed for global, multilingual applications. It is fast, trained on function-calling, has a large context window, and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. It is compatible with [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html). For best performance on a given task, users are encouraged to customize the model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner). Refer to the [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/nemotron/index.html) for examples. |
|
|
|
**Model Developer:** [NVIDIA](https://www.nvidia.com/en-us/) and [MistralAI](https://mistral.ai/) |
|
|
|
**Model Dates:** Mistral-NeMo-12B-Base was trained between 2023 and July 2024. |
|
|
|
### Model Architecture: |
|
|
|
Mistral-NeMo-12B-Base is a transformer model, with the following architecture choices: |
|
|
|
- Layers: 40 |
|
- Dim: 5,120 |
|
- Head dim: 128 |
|
- Hidden dim: 14,436 |
|
- Activation Function: SwiGLU |
|
- Number of heads: 32 |
|
- Number of kv-heads: 8 (GQA) |
|
- Rotary embeddings (theta = 1M) |
|
- Vocabulary size: 2**17 ~= 128k |
|
|
|
**Architecture Type:** Transformer Decoder (auto-regressive language model) |
|
|
|
### Evaluation Results |
|
|
|
**Main Benchmarks** |
|
- HellaSwag (0-shot): 83.5% |
|
- Winogrande (0-shot): 76.8% |
|
- OpenBookQA (0-shot): 60.6% |
|
- CommonSenseQA (0-shot): 70.4% |
|
- TruthfulQA (0-shot): 50.3% |
|
- MMLU (5-shot): 68.0% |
|
- TriviaQA (5-shot): 73.8% |
|
- NaturalQuestions (5-shot): 31.2% |
|
|
|
**Multilingual benchmarks** |
|
|
|
Multilingual MMLU in 5-shot setting: |
|
- French: 62.3% |
|
- German: 62.7% |
|
- Spanish: 64.6% |
|
- Italian: 61.3% |
|
- Portuguese: 63.3% |
|
- Russian: 59.2% |
|
- Chinese: 59.0% |
|
- Japanese: 59.0% |
|
|