neuralmagic
/

Meta-Llama-3.1-405B-Instruct-FP8-dynamic

Text Generation

text-generation-inference

Inference Endpoints

compressed-tensors

Model card Files Files and versions Community

ekurtic commited on Oct 19

Commit

f4dbba5

•

1 Parent(s): 81758d1

Update README

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -42,7 +42,7 @@ This model was obtained by quantizing the weights and activations of [Meta-Llama
 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
-[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
 ## Deployment

 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
+[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
 ## Deployment