Fp8 conversion performance makes it slower than float16

My kernel performs (batched) dot product of vectors. I am casting the input type to float32 and doing the multiply-accumulate there.
Surprisingly on H100 I get somewhat worse performance for fp8_e4m3 input type than I get for float16/bfloat16.
Is it possible that the conversion functions to/from fp8 are so slow?!
I would expect that using half the IO, the fp8 function would be ~x2 faster.

Also consider using the mma (Tensor Cores).

On the H100 you can get 4 independent FP8 dot products of two vectors each with 16 elements/vector per SM per cycle including input conversion (and you probably have more inputs than outputs).

You would get even more performance, if some of the vectors are reused.

Perhaps you can combine those with the conventional compute engines.

For small tensors, tensor cores are slower because of wave quantization and poor utilization. Good old dot product is actually faster. For 16b inputs it works like magic, but the e4m3 type is slow.

By not creating the matrices in global memory, but directly using mma, it should work better.

About the conversions: The Cuda programming guide states 16 conversions per SM/cycle for the H100, regardless of FP8 or FP16/BF16. What speeds do you get for each type?

common knowledge has mma working better, but this is not true for small sizes.
On H100 I get 7.5us for fp16, which is better than pytorch native 11us for the same op.
For fp8 I get 11.7us. So strange…

How many dot products of which number of elements do you execute in that time?

The smaller the data size, the faster mma. At least it should not get slower.

I always like to say: “There has to be a rational explanation, I just haven’t found it yet.”

Have you performed a detailed comparison of the profiler output for each variant? Any significant difference in a metric (or metrics) would likely point you at whatever causes the unexpected performance difference.

Another possible angle of attack would be a review of the generated machine code for the two cases. Without some previously acquired proficiency in analyzing SASS that is likely a less fruitful path. But if you have done such code review before, something might jump out at you.

I always like to say that behind anything that appears to be rational there is strangeness waiting to burst out. Quantum mechanics agrees with me.

I will try to use ncu to figure it out once I fix my cluster problems (no permissions).
Thanks for the help!