NeMo
nvidia
shrimai19's picture
Update README.md
d213ce0 verified
metadata
license: apache-2.0
tags:
  - nvidia

Mistral-NeMo-12B-Base

Model architectureModel sizeLanguage

Model Overview:

Mistral-NeMo-12B-Base is a Large Language Model (LLM) composed of 12B parameters, trained jointly by NVIDIA and Mistral AI. It significantly outperforms existing models smaller or similar in size.

Key features

  • Released under the Apache 2 License
  • Pre-trained and instructed versions
  • Trained with a 128k context window
  • Trained on a large proportion of multilingual and code data

Intended use

Mistral-NeMo-12B-Base is a completion model intended for use in over 80+ programming languages and designed for global, multilingual applications. It is fast, trained on function-calling, has a large context window, and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. It is compatible with NVIDIA NeMo Framework. For best performance on a given task, users are encouraged to customize the model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using NeMo-Aligner.

Model Developer: NVIDIA and MistralAI

Model Dates: Mistral-NeMo-12B-Base was trained between May 2024 and June 2024.

Transformers format: https://huggingface.co/mistralai/Mistral-Nemo-Base-2407

Model Architecture:

Mistral-NeMo-12B-Base is a transformer model, with the following architecture choices:

  • Layers: 40
  • Dim: 5,120
  • Head dim: 128
  • Hidden dim: 14,436
  • Activation Function: SwiGLU
  • Number of heads: 32
  • Number of kv-heads: 8 (GQA)
  • Rotary embeddings (theta = 1M)
  • Vocabulary size: 2**17 ~= 128k

Architecture Type: Transformer Decoder (auto-regressive language model)

Dataset & Training

The training corpus for Mistral-NeMo-12B-Base consists of English and multilingual text, as well as code. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more.

Data Freshness: The pretraining data has a cutoff of April 2024.

Evaluation Results

Main Benchmarks

  • HellaSwag (0-shot): 83.5%
  • Winogrande (0-shot): 76.8%
  • OpenBookQA (0-shot): 60.6%
  • CommonSenseQA (0-shot): 70.4%
  • TruthfulQA (0-shot): 50.3%
  • MMLU (5-shot): 68.0%
  • TriviaQA (5-shot): 73.8%
  • NaturalQuestions (5-shot): 31.2%

Multilingual benchmarks

Multilingual MMLU in 5-shot setting:

  • French: 62.3%
  • German: 62.7%
  • Spanish: 64.6%
  • Italian: 61.3%
  • Portuguese: 63.3%
  • Russian: 59.2%
  • Chinese: 59.0%
  • Japanese: 59.0%

Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.