neuralmagic
/

Meta-Llama-3.1-405B-Instruct-FP8-dynamic

@@ -8,7 +8,7 @@ language:
 - en
 ---
-# Meta-Llama-3.1-70B-Instruct-FP8-dynamic
 ## Model Overview
 - **Model Architecture:** Meta-Llama-3.1
@@ -24,13 +24,13 @@ language:
 - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
 - **Model Developers:** Neural Magic
-Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
 It achieves an average score of 78.69 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
 ### Model Optimizations
-This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
-This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
 [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
@@ -45,8 +45,8 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
-model_id = "neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic"
-number_gpus = 2
 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
@@ -106,11 +106,11 @@ quant_stage:
                     targets: ["Linear"]
 """
-model_stub = "meta-llama/Meta-Llama-3.1-70B-Instruct"
 model_name = model_stub.split("/")[-1]
 device_map = calculate_offload_device_map(
-    model_stub, reserve_for_hessians=False, num_gpus=2, torch_dtype=torch.float16
 )
 model = SparseAutoModelForCausalLM.from_pretrained(
@@ -134,7 +134,7 @@ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic",dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.8,add_bos_token=True,max_model_len=4096 \
   --tasks openllm \
   --batch_size auto
 ```
@@ -146,9 +146,9 @@ lm_eval \
   <tr>
    <td><strong>Benchmark</strong>
    </td>
-   <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
    </td>
-   <td><strong>Meta-Llama-3.1-70B-Instruct-FP8-dynamic(this model)</strong>
    </td>
    <td><strong>Recovery</strong>
    </td>

 - en
 ---
+# Meta-Llama-3.1-405B-Instruct-FP8-dynamic
 ## Model Overview
 - **Model Architecture:** Meta-Llama-3.1
 - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
 - **Model Developers:** Neural Magic
+Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
 It achieves an average score of 78.69 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
 ### Model Optimizations
+This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
 [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
+model_id = "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic"
+number_gpus = 8
 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
                     targets: ["Linear"]
 """
+model_stub = "meta-llama/Meta-Llama-3.1-405B-Instruct"
 model_name = model_stub.split("/")[-1]
 device_map = calculate_offload_device_map(
+    model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
 )
 model = SparseAutoModelForCausalLM.from_pretrained(
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic",dtype=auto,tensor_parallel_size=8,gpu_memory_utilization=0.755,add_bos_token=True,max_model_len=4096 \
   --tasks openllm \
   --batch_size auto
 ```
   <tr>
    <td><strong>Benchmark</strong>
    </td>
+   <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
    </td>
+   <td><strong>Meta-Llama-3.1-405B-Instruct-FP8-dynamic(this model)</strong>
    </td>
    <td><strong>Recovery</strong>
    </td>