Table 2 in section 5.4.1 of the programming guide says that the throughput for 16-bit floating-point add, multiply, and multiply-add for compute capability 6.1 is only 2. Is that an error? All of the other compute capabilities have twice the throughput for half precision as for single precision, so is that meant to be 256?
It’s not an error.
The design of the cc6.1 SM is different from the design of the cc6.0/cc6.2/cc7.x/cc5.3 SM in this respect.
The throughput of FP16 on cc6.1 is relatively low. The reason for the existence of such a low throughput capability is for application compatibility. It is not a performance path on cc6.1
Note that for parameter storage (as opposed to compute throughput) FP16 could still possibly be a “win” on cc6.1, in some cases, where memory traffic drives application performance. The 2:1 ratio of parameter storage density over FP32 means that in some cases it may be beneficial to store data in (packed) FP16 format, but convert on the fly to FP32 (for calculations) and back to FP16 (for storage). This assumes your app/algorithm can tolerate the numerical implications of parameter storage in FP16.
Roger that. Thanks for the quick reply.