Difference in SM performance of float16 and bfloat16

Robert_Crovella · August 7, 2024, 9:44pm

FP16 (not BF16) non-tensorcore throughput on Ampere has an especially high throughput. see here. I suspect that your IPC measurement (being comparable) suggests that you are getting approximately 1/2 of the peak theoretical FP16 throughput. I can’t get into a detailed explanation for several reasons. I don’t know for sure what throughput you are getting, and I don’t have a recipe to unlock the additional 2x factor for FP16 on Ampere, if you are not already getting it.

Topic		Replies	Views
fp16 vs fp32 CUDA Programming and Performance	3	4080	November 13, 2017
Half Float and Fermi CUDA Programming and Performance	1	4228	October 23, 2009
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1176	April 26, 2018
FP16x2 ops in sm_52 CUDA Programming and Performance	22	5316	January 19, 2015
About instruction throughputs CUDA Programming and Performance	9	5180	May 27, 2010
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7671	March 28, 2017
16 bit float operations CUDA Programming and Performance	2	7703	April 7, 2015
FP16 vs FP32 CUDA Programming and Performance	3	2402	May 23, 2019
Unexpectedly low performance of cuFFT with half floating point (FP16) GPU-Accelerated Libraries	1	1690	June 16, 2017
how does fermi join two core for DP fermi, double precision instruction CUDA Programming and Performance	4	1360	April 2, 2012

Difference in SM performance of float16 and bfloat16

Related topics