Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ language:
|
|
8 |
- en
|
9 |
---
|
10 |
|
11 |
-
# Meta-Llama-3.1-
|
12 |
|
13 |
## Model Overview
|
14 |
- **Model Architecture:** Meta-Llama-3.1
|
@@ -24,13 +24,13 @@ language:
|
|
24 |
- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
|
25 |
- **Model Developers:** Neural Magic
|
26 |
|
27 |
-
Quantized version of [Meta-Llama-3.1-
|
28 |
It achieves an average score of 78.69 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
|
29 |
|
30 |
### Model Optimizations
|
31 |
|
32 |
-
This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-
|
33 |
-
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
|
34 |
|
35 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
|
36 |
[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
|
@@ -45,8 +45,8 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
|
|
45 |
from vllm import LLM, SamplingParams
|
46 |
from transformers import AutoTokenizer
|
47 |
|
48 |
-
model_id = "neuralmagic/Meta-Llama-3.1-
|
49 |
-
number_gpus =
|
50 |
|
51 |
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
|
52 |
|
@@ -106,11 +106,11 @@ quant_stage:
|
|
106 |
targets: ["Linear"]
|
107 |
"""
|
108 |
|
109 |
-
model_stub = "meta-llama/Meta-Llama-3.1-
|
110 |
model_name = model_stub.split("/")[-1]
|
111 |
|
112 |
device_map = calculate_offload_device_map(
|
113 |
-
model_stub, reserve_for_hessians=False, num_gpus=
|
114 |
)
|
115 |
|
116 |
model = SparseAutoModelForCausalLM.from_pretrained(
|
@@ -134,7 +134,7 @@ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-
|
|
134 |
```
|
135 |
lm_eval \
|
136 |
--model vllm \
|
137 |
-
--model_args pretrained="neuralmagic/Meta-Llama-3.1-
|
138 |
--tasks openllm \
|
139 |
--batch_size auto
|
140 |
```
|
@@ -146,9 +146,9 @@ lm_eval \
|
|
146 |
<tr>
|
147 |
<td><strong>Benchmark</strong>
|
148 |
</td>
|
149 |
-
<td><strong>Meta-Llama-3.1-
|
150 |
</td>
|
151 |
-
<td><strong>Meta-Llama-3.1-
|
152 |
</td>
|
153 |
<td><strong>Recovery</strong>
|
154 |
</td>
|
|
|
8 |
- en
|
9 |
---
|
10 |
|
11 |
+
# Meta-Llama-3.1-405B-Instruct-FP8-dynamic
|
12 |
|
13 |
## Model Overview
|
14 |
- **Model Architecture:** Meta-Llama-3.1
|
|
|
24 |
- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
|
25 |
- **Model Developers:** Neural Magic
|
26 |
|
27 |
+
Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
|
28 |
It achieves an average score of 78.69 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
|
29 |
|
30 |
### Model Optimizations
|
31 |
|
32 |
+
This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
|
33 |
+
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
|
34 |
|
35 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
|
36 |
[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
|
|
|
45 |
from vllm import LLM, SamplingParams
|
46 |
from transformers import AutoTokenizer
|
47 |
|
48 |
+
model_id = "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic"
|
49 |
+
number_gpus = 8
|
50 |
|
51 |
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
|
52 |
|
|
|
106 |
targets: ["Linear"]
|
107 |
"""
|
108 |
|
109 |
+
model_stub = "meta-llama/Meta-Llama-3.1-405B-Instruct"
|
110 |
model_name = model_stub.split("/")[-1]
|
111 |
|
112 |
device_map = calculate_offload_device_map(
|
113 |
+
model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
|
114 |
)
|
115 |
|
116 |
model = SparseAutoModelForCausalLM.from_pretrained(
|
|
|
134 |
```
|
135 |
lm_eval \
|
136 |
--model vllm \
|
137 |
+
--model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic",dtype=auto,tensor_parallel_size=8,gpu_memory_utilization=0.755,add_bos_token=True,max_model_len=4096 \
|
138 |
--tasks openllm \
|
139 |
--batch_size auto
|
140 |
```
|
|
|
146 |
<tr>
|
147 |
<td><strong>Benchmark</strong>
|
148 |
</td>
|
149 |
+
<td><strong>Meta-Llama-3.1-405B-Instruct </strong>
|
150 |
</td>
|
151 |
+
<td><strong>Meta-Llama-3.1-405B-Instruct-FP8-dynamic(this model)</strong>
|
152 |
</td>
|
153 |
<td><strong>Recovery</strong>
|
154 |
</td>
|