Difference in SM performance of float16 and bfloat16

dhjoo982 · August 7, 2024, 8:05pm

CUDA C++ Programming Guide (nvidia.com) states that Compute Capability 8.0 and 8.6 throughput of “16-bit floating-point add, multiply, multiply-add” arithmetic instruction is different for fp16 (256 results per Clock Cycle per SM) and bfloat16 (128 results).

If they are using the same fp16 pipe, what is the reason behind the performance difference?

njuffa · August 7, 2024, 9:01pm

Is there any evidence that IEEE-754 FP16 and BFLOAT16 arithmetic are handled by the same execution pipes in these GPUs?

Since BFLOAT16 format is essentially a truncated version of the IEEE-754 FP32 format, BFLOAT16 operations could be implemented via the FP32 execution pipes with just minor hardware modifications, especially if it was a late addition to the GPU microarchitecture.

dhjoo982 · August 7, 2024, 9:16pm

NCU shows the same FP16 pipe being utilized for the a sequence of FP16 ops and a sequence of BF16 operations.

We were having doubts because we are measuring the same IPC for both cases, which conflicts the Programming Guide mentioned above.

Robert_Crovella · August 7, 2024, 9:44pm

FP16 (not BF16) non-tensorcore throughput on Ampere has an especially high throughput. see here. I suspect that your IPC measurement (being comparable) suggests that you are getting approximately 1/2 of the peak theoretical FP16 throughput. I can’t get into a detailed explanation for several reasons. I don’t know for sure what throughput you are getting, and I don’t have a recipe to unlock the additional 2x factor for FP16 on Ampere, if you are not already getting it.

njuffa · August 7, 2024, 9:46pm

Although I worked on floating-point units during a part of my professional career, I am not an expert on the largely undisclosed details of NVIDIA’s GPU microarchitecture.

IPC refers to instructions per but since the FP16 pipeline performs (to the best of my knowledge) 2-way SIMD operations, the number of results per cycle would be twice that.

You could file a bug report with NVIDIA pointing out the discrepancy between your measurements and the programming guide.

Topic		Replies	Views
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	1038	August 20, 2024
fp16 vs fp32 CUDA Programming and Performance	3	4257	November 13, 2017
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	1078	November 25, 2024
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1253	April 26, 2018
FP16 vs FP32 CUDA Programming and Performance	3	2466	May 23, 2019
can 16-bits and 32-bits Native Arithmetic Instructions run independently ? CUDA Programming and Performance	1	933	December 6, 2019
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7939	March 28, 2017
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	1487	December 10, 2024
Unexpectedly low performance of cuFFT with half floating point (FP16) GPU-Accelerated Libraries	1	1748	June 16, 2017
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	2777	September 27, 2024

Difference in SM performance of float16 and bfloat16

Related topics