That's not an invalid point, but also when the final goal is quantization that 0.03% is negligible compared to the rest of the losses.
If you're talking about running at full precision, yeah, bf16 > fp16 by all means
I'd also prefer to see KLD of fp16 vs bf16 since PPL is, to me, pretty meaningless. I'm sure it has value and probably more than I give it, but unless it's PPL against the dataset it was trained on I don't really find much merit to it.
I appreciate the breakdown though, and even 0.4% is not enough to worry me when again the final goal is quantization, not to run it at that DTYPE.
To that end, do you happen to know if when quantizing from BF16.. does it get converted to FP16 first? Does it even matter? BF16 -> Q8 vs BF16 -> FP16 -> Q8, I wonder how different it would be. Gut instinct says it's in the 0.01% range.