Update README
Browse files
README.md
CHANGED
@@ -42,7 +42,7 @@ This model was obtained by quantizing the weights and activations of [Meta-Llama
|
|
42 |
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
|
43 |
|
44 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
|
45 |
-
[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization
|
46 |
|
47 |
## Deployment
|
48 |
|
|
|
42 |
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
|
43 |
|
44 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
|
45 |
+
[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
|
46 |
|
47 |
## Deployment
|
48 |
|