FP16 (not BF16) non-tensorcore throughput on Ampere has an especially high throughput. see here. I suspect that your IPC measurement (being comparable) suggests that you are getting approximately 1/2 of the peak theoretical FP16 throughput. I can’t get into a detailed explanation for several reasons. I don’t know for sure what throughput you are getting, and I don’t have a recipe to unlock the additional 2x factor for FP16 on Ampere, if you are not already getting it.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
fp16 vs fp32 | 3 | 4080 | November 13, 2017 | |
Half Float and Fermi | 1 | 4228 | October 23, 2009 | |
FP32 and FP16 activity during a pure 32bit float CUDA application is running | 4 | 1176 | April 26, 2018 | |
FP16x2 ops in sm_52 | 22 | 5316 | January 19, 2015 | |
About instruction throughputs | 9 | 5180 | May 27, 2010 | |
How FP32 and FP16 units are implemented in GP100 GPU's | 8 | 7671 | March 28, 2017 | |
16 bit float operations | 2 | 7703 | April 7, 2015 | |
FP16 vs FP32 | 3 | 2402 | May 23, 2019 | |
Unexpectedly low performance of cuFFT with half floating point (FP16) | 1 | 1690 | June 16, 2017 | |
how does fermi join two core for DP fermi, double precision instruction | 4 | 1360 | April 2, 2012 |