Difference in SM performance of float16 and bfloat16

FP16 (not BF16) non-tensorcore throughput on Ampere has an especially high throughput. see here. I suspect that your IPC measurement (being comparable) suggests that you are getting approximately 1/2 of the peak theoretical FP16 throughput. I can’t get into a detailed explanation for several reasons. I don’t know for sure what throughput you are getting, and I don’t have a recipe to unlock the additional 2x factor for FP16 on Ampere, if you are not already getting it.