Difference in SM performance of float16 and bfloat16

CUDA C++ Programming Guide (nvidia.com) states that Compute Capability 8.0 and 8.6 throughput of “16-bit floating-point add, multiply, multiply-add” arithmetic instruction is different for fp16 (256 results per Clock Cycle per SM) and bfloat16 (128 results).

If they are using the same fp16 pipe, what is the reason behind the performance difference?

Is there any evidence that IEEE-754 FP16 and BFLOAT16 arithmetic are handled by the same execution pipes in these GPUs?

Since BFLOAT16 format is essentially a truncated version of the IEEE-754 FP32 format, BFLOAT16 operations could be implemented via the FP32 execution pipes with just minor hardware modifications, especially if it was a late addition to the GPU microarchitecture.

NCU shows the same FP16 pipe being utilized for the a sequence of FP16 ops and a sequence of BF16 operations.

We were having doubts because we are measuring the same IPC for both cases, which conflicts the Programming Guide mentioned above.

FP16 (not BF16) non-tensorcore throughput on Ampere has an especially high throughput. see here. I suspect that your IPC measurement (being comparable) suggests that you are getting approximately 1/2 of the peak theoretical FP16 throughput. I can’t get into a detailed explanation for several reasons. I don’t know for sure what throughput you are getting, and I don’t have a recipe to unlock the additional 2x factor for FP16 on Ampere, if you are not already getting it.

Although I worked on floating-point units during a part of my professional career, I am not an expert on the largely undisclosed details of NVIDIA’s GPU microarchitecture.

IPC refers to instructions per but since the FP16 pipeline performs (to the best of my knowledge) 2-way SIMD operations, the number of results per cycle would be twice that.

You could file a bug report with NVIDIA pointing out the discrepancy between your measurements and the programming guide.