Lin-K76 commited on
Commit
f18b9c8
1 Parent(s): 0b0dbe1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -8,7 +8,7 @@ language:
8
  - en
9
  ---
10
 
11
- # Meta-Llama-3.1-70B-Instruct-FP8-dynamic
12
 
13
  ## Model Overview
14
  - **Model Architecture:** Meta-Llama-3.1
@@ -24,13 +24,13 @@ language:
24
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
25
  - **Model Developers:** Neural Magic
26
 
27
- Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
28
  It achieves an average score of 78.69 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
29
 
30
  ### Model Optimizations
31
 
32
- This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
33
- This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
34
 
35
  Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
36
  [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
@@ -45,8 +45,8 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
45
  from vllm import LLM, SamplingParams
46
  from transformers import AutoTokenizer
47
 
48
- model_id = "neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic"
49
- number_gpus = 2
50
 
51
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
52
 
@@ -106,11 +106,11 @@ quant_stage:
106
  targets: ["Linear"]
107
  """
108
 
109
- model_stub = "meta-llama/Meta-Llama-3.1-70B-Instruct"
110
  model_name = model_stub.split("/")[-1]
111
 
112
  device_map = calculate_offload_device_map(
113
- model_stub, reserve_for_hessians=False, num_gpus=2, torch_dtype=torch.float16
114
  )
115
 
116
  model = SparseAutoModelForCausalLM.from_pretrained(
@@ -134,7 +134,7 @@ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-
134
  ```
135
  lm_eval \
136
  --model vllm \
137
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic",dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.8,add_bos_token=True,max_model_len=4096 \
138
  --tasks openllm \
139
  --batch_size auto
140
  ```
@@ -146,9 +146,9 @@ lm_eval \
146
  <tr>
147
  <td><strong>Benchmark</strong>
148
  </td>
149
- <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
150
  </td>
151
- <td><strong>Meta-Llama-3.1-70B-Instruct-FP8-dynamic(this model)</strong>
152
  </td>
153
  <td><strong>Recovery</strong>
154
  </td>
 
8
  - en
9
  ---
10
 
11
+ # Meta-Llama-3.1-405B-Instruct-FP8-dynamic
12
 
13
  ## Model Overview
14
  - **Model Architecture:** Meta-Llama-3.1
 
24
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
25
  - **Model Developers:** Neural Magic
26
 
27
+ Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
28
  It achieves an average score of 78.69 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
29
 
30
  ### Model Optimizations
31
 
32
+ This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
33
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
34
 
35
  Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
36
  [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
 
45
  from vllm import LLM, SamplingParams
46
  from transformers import AutoTokenizer
47
 
48
+ model_id = "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic"
49
+ number_gpus = 8
50
 
51
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
52
 
 
106
  targets: ["Linear"]
107
  """
108
 
109
+ model_stub = "meta-llama/Meta-Llama-3.1-405B-Instruct"
110
  model_name = model_stub.split("/")[-1]
111
 
112
  device_map = calculate_offload_device_map(
113
+ model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
114
  )
115
 
116
  model = SparseAutoModelForCausalLM.from_pretrained(
 
134
  ```
135
  lm_eval \
136
  --model vllm \
137
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic",dtype=auto,tensor_parallel_size=8,gpu_memory_utilization=0.755,add_bos_token=True,max_model_len=4096 \
138
  --tasks openllm \
139
  --batch_size auto
140
  ```
 
146
  <tr>
147
  <td><strong>Benchmark</strong>
148
  </td>
149
+ <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
150
  </td>
151
+ <td><strong>Meta-Llama-3.1-405B-Instruct-FP8-dynamic(this model)</strong>
152
  </td>
153
  <td><strong>Recovery</strong>
154
  </td>