W8A16
I see you've uploaded a few models tagged as W8A16 but these are actually dynamic FP8 W8A8 afaict.
They will run on Ampere as W8A16 because 8bit activations are not supported. But on Ada Lovelace and above this will run as W8A8 dynamic unless you change the config.
@qeternity This model does in fact work on Ampere, as explained in the ReadME.
Instead of using FP8 tensor cores, it utilized fp16 tensor cores, and uses bit arithmetic and SIMT to dequantize rapidly. The kernel spent a ton of time in NSIGHT and was crafted to be VERY VERY efficient, so this is as close as we can get to using fp8 on ampere, oh and why not activations? Because to dequantize everything including the generated cache which is much too compute intensive to be worth it.
Key Features of FP8 Marlin
The NeuralMagic FP8 Marlin kernel achieves impressive efficiency by packing 4 8-bit values into an int32 and performing a 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach yields nearly a 2x speedup over FP16 on most models while maintaining near lossless quality.
FP8 Advantages on NVIDIA GPUs
On newer NVIDIA GPUs (4090/H100 or later), dedicated FP8 tensor cores and hardware allow fast conversion from FP8 to BF16/FP16, maximizing performance. However, older GPUs lack this specific hardware support, preventing activation quantization if we want to leverage FP8. The Marlin kernel addresses this gap effectively, enabling performance gains on Ampere cards (e.g., 3090, A100) without needing full tensor core support.
Traditional int8 quantization methods often require extensive overhead for data type conversion between int8 and fp16, making them less efficient for inference. Marlin’s FP8 kernel bypasses this limitation by staying predominantly in FP16, removing the need for such conversions during runtime.
As far as i am aware, this will only work in VLLM and I use it all the time on my 3090.
To run use this command:
python3 -m vllm.entrypoints.openai.api_server \
--model Vezora/QwQ-32B-Preview-fp8-W8A16 \
--dtype auto \
--api-key token-abc123 \
--quantization compressed-tensors \
--max-num-batched-tokens 16384 \
--max-model-len 16384 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.99
Yeah @qeternity you can look at the script I followed nueral magics instructions for quantizing a model to w8a16. If you look at the config the lm_head is left in fp16.
You can also run this on ada cards and have the cache be in fp8 but that is quantized on the fly before loading the model, which isn’t a big deal since fp8 does not require calibration.
I'm not talking about the cache. I'm talking about the activations.
This config will run with the activations in fp8 on hardware that supports it (> hopper). It is actually fp8 w8a8 as you can see here https://huggingface.co/Vezora/QwQ-32B-Preview-fp8-W8A16/blob/main/config.json#L16
It will run implicitly as w8a16 on hardware that marlin fp8 supports weight-only quantizations for (ampere).
It's a minor issue, but this isn't w8a16. It's just straight up fp8.