Fp8 conversion performance makes it slower than float16

otropp · March 2, 2025, 1:20pm

My kernel performs (batched) dot product of vectors. I am casting the input type to float32 and doing the multiply-accumulate there.
Surprisingly on H100 I get somewhat worse performance for fp8_e4m3 input type than I get for float16/bfloat16.
Is it possible that the conversion functions to/from fp8 are so slow?!
I would expect that using half the IO, the fp8 function would be ~x2 faster.

Curefab · March 2, 2025, 2:08pm

Also consider using the mma (Tensor Cores).

On the H100 you can get 4 independent FP8 dot products of two vectors each with 16 elements/vector per SM per cycle including input conversion (and you probably have more inputs than outputs).

You would get even more performance, if some of the vectors are reused.

Perhaps you can combine those with the conventional compute engines.

otropp · March 2, 2025, 2:13pm

For small tensors, tensor cores are slower because of wave quantization and poor utilization. Good old dot product is actually faster. For 16b inputs it works like magic, but the e4m3 type is slow.

Curefab · March 2, 2025, 2:22pm

By not creating the matrices in global memory, but directly using mma, it should work better.

About the conversions: The Cuda programming guide states 16 conversions per SM/cycle for the H100, regardless of FP8 or FP16/BF16. What speeds do you get for each type?

otropp · March 2, 2025, 2:26pm

common knowledge has mma working better, but this is not true for small sizes.
On H100 I get 7.5us for fp16, which is better than pytorch native 11us for the same op.
For fp8 I get 11.7us. So strange…

Curefab · March 2, 2025, 2:57pm

How many dot products of which number of elements do you execute in that time?

The smaller the data size, the faster mma. At least it should not get slower.

njuffa · March 2, 2025, 5:35pm

I always like to say: “There has to be a rational explanation, I just haven’t found it yet.”

Have you performed a detailed comparison of the profiler output for each variant? Any significant difference in a metric (or metrics) would likely point you at whatever causes the unexpected performance difference.

Another possible angle of attack would be a review of the generated machine code for the two cases. Without some previously acquired proficiency in analyzing SASS that is likely a less fruitful path. But if you have done such code review before, something might jump out at you.

otropp · March 3, 2025, 12:03pm

I always like to say that behind anything that appears to be rational there is strangeness waiting to burst out. Quantum mechanics agrees with me.

I will try to use ncu to figure it out once I fix my cluster problems (no permissions).
Thanks for the help!

Topic		Replies	Views
converting fp32 math to fp16 fails to give speed up CUDA Programming and Performance	5	1499	November 21, 2017
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	536	November 25, 2024
Wrong answer with mma.sync.aligned.m8n8k4 CUDA Programming and Performance cuda , kernel	8	1268	April 17, 2023
Type conversion throughput/latency CUDA Programming and Performance	5	489	February 3, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2532	August 12, 2017
fp16 vs fp32 CUDA Programming and Performance	3	3893	November 13, 2017
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5838	August 14, 2024
FP16 vs FP32 CUDA Programming and Performance	3	2381	May 23, 2019
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	575	August 7, 2024
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7485	March 28, 2017

Fp8 conversion performance makes it slower than float16

Related topics