Relations between instruction throughput and CUDA compute capability


The CUDA programming guide provides a table for the throughput of native arithmetic instructions for devices of different compute capabilities.

However, the relationship between the instruction throughput and compute capability is really confusing. For example, according to the table, CC 7.x & 8.0 devices computes 32 64-bit float-point instructions per cycle per SM, while for CC 8.6 & 8.9 devices this number immediately drops to 4.

Therefore, my question here is why sometimes the instruction throughput on newer devices with higher compute capability can get significantly lower than previous devices? In other words, can I expect to get higher instruction throughput per SM per cycle on newer devices?


GPU architectures used for consumer GPUs are designed with low FP64 throughput, GPU architectures used for HPC GPUs are designed with high FP64 throughput. This is part of NVIDIA’s market differentiation approach. Compute capabilities 6.0, 7.0, 8.0, and 9.0 are or were used for HPC GPUs.

Generally speaking, no. One can find multiple examples where this assumption does not hold. It all boils down to architectural choices, which are constantly adjusted to evolving market requirements. For example, a newer architecture may de-emphasize a particular instruction class (for example, IIRC this has happened to MUFU instructions twice), or may opt to use smaller SMs, but deploy more of them.

Note that there is no need to make assumptions on the part of CUDA programmers, because NVIDIA documents the throughput of various operation classes for each GPU architecture in the Programming Guide, as you have found.

1 Like

Thanks for your great answer!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.