---
license: apache-2.0
tags:
- nvidia
---

## Mistral-NeMo-12B-Base

[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-12B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)

### Model Overview:

Mistral-NeMo-12B-Base is a Large Language Model (LLM) composed of 12B parameters, trained jointly by NVIDIA and Mistral AI. It significantly outperforms existing models smaller or similar in size.

**Key features**
- Released under the Apache 2 License
- Pre-trained and instructed versions
- Trained with a 128k context window
- Trained on a large proportion of multilingual and code data

### Intended use

Mistral-NeMo-12B-Base is a completion model intended for use in over 80+ programming languages and designed for global, multilingual applications. It is fast, trained on function-calling, has a large context window, and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. It is compatible with [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html). For best performance on a given task, users are encouraged to customize the model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner). Refer to the [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/nemotron/index.html) for examples.

**Model Developer:** [NVIDIA](https://www.nvidia.com/en-us/) and [MistralAI](https://mistral.ai/)

**Model Dates:** Mistral-NeMo-12B-Base was trained between  2023 and July 2024.

### Model Architecture:

Mistral-NeMo-12B-Base is a transformer model, with the following architecture choices:

- Layers: 40
- Dim: 5,120
- Head dim: 128
- Hidden dim: 14,436
- Activation Function: SwiGLU
- Number of heads: 32
- Number of kv-heads: 8 (GQA)
- Rotary embeddings (theta = 1M)
- Vocabulary size: 2**17 ~= 128k
  
**Architecture Type:** Transformer Decoder (auto-regressive language model)

### Evaluation Results

**Main Benchmarks**
- HellaSwag (0-shot): 83.5%
- Winogrande (0-shot): 76.8%
- OpenBookQA (0-shot): 60.6%
- CommonSenseQA (0-shot): 70.4%
- TruthfulQA (0-shot): 50.3%
- MMLU (5-shot): 68.0%
- TriviaQA (5-shot): 73.8%
- NaturalQuestions (5-shot): 31.2%

**Multilingual benchmarks**

Multilingual MMLU in 5-shot setting:
- French: 62.3%
- German: 62.7%
- Spanish: 64.6%
- Italian: 61.3%
- Portuguese: 63.3%
- Russian: 59.2%
- Chinese: 59.0%
- Japanese: 59.0%