--- license: apache-2.0 tags: - nvidia --- ## Mistral-NeMo-12B-Base [![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-12B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets) ### Model Overview: Mistral-NeMo-12B-Base is a Large Language Model (LLM) composed of 12B parameters, trained jointly by NVIDIA and Mistral AI. It significantly outperforms existing models smaller or similar in size. **Key features** - Released under the Apache 2 License - Pre-trained and instructed versions - Trained with a 128k context window - Trained on a large proportion of multilingual and code data ### Intended use Mistral-NeMo-12B-Base is a completion model intended for use in over 80+ programming languages and designed for global, multilingual applications. It is fast, trained on function-calling, has a large context window, and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. It is compatible with [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html). For best performance on a given task, users are encouraged to customize the model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner). Refer to the [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/nemotron/index.html) for examples. **Model Developer:** [NVIDIA](https://www.nvidia.com/en-us/) and [MistralAI](https://mistral.ai/) **Model Dates:** Mistral-NeMo-12B-Base was trained between 2023 and July 2024. ### Model Architecture: Mistral-NeMo-12B-Base is a transformer model, with the following architecture choices: - Layers: 40 - Dim: 5,120 - Head dim: 128 - Hidden dim: 14,436 - Activation Function: SwiGLU - Number of heads: 32 - Number of kv-heads: 8 (GQA) - Rotary embeddings (theta = 1M) - Vocabulary size: 2**17 ~= 128k **Architecture Type:** Transformer Decoder (auto-regressive language model) ### Evaluation Results **Main Benchmarks** - HellaSwag (0-shot): 83.5% - Winogrande (0-shot): 76.8% - OpenBookQA (0-shot): 60.6% - CommonSenseQA (0-shot): 70.4% - TruthfulQA (0-shot): 50.3% - MMLU (5-shot): 68.0% - TriviaQA (5-shot): 73.8% - NaturalQuestions (5-shot): 31.2% **Multilingual benchmarks** Multilingual MMLU in 5-shot setting: - French: 62.3% - German: 62.7% - Spanish: 64.6% - Italian: 61.3% - Portuguese: 63.3% - Russian: 59.2% - Chinese: 59.0% - Japanese: 59.0%