Hi, I’m part of the team responsible for the instruction throughput table in the CUDA programming guide, and I wanted to give you an update on this thread from my point of view:
First, thank you all for raising these issues!
With CUDA 13.0, we will make significant changes to the table: (1) it will be moved to the CUDA best practices guide as it is less relevant to the programming model and (2) it will be re-structured, and will have example PTX instructions which will hopefully provide a little more clarity, although it is obviously far from perfect.
Concerning Blackwell integer instruction throughput specifically:
- As pointed out in the thread already, some instructions have not been improved/changed: this applies to IMAD, LOP3, PRMT for example, as well as IADD3.
- The main improved instructions relevant to this thread are IADD, IMNMX/VIMNMX, FSETP/ISETP: addition of 2 operands and min/max/compare.
- Concerning integer addition specifically, it is even more complicated: previous architectures already had the possibility to achieve 2x throughput by combining e.g. IADD3 with IMAD.IADD or VIADD (for 9.0/10.0). Blackwell 12.0 now allows achieving this 2x throughput with a single instruction: IADD. But note that it can be difficult to get a sequence of instructions achieving higher throughput: for previous architectures, this is because of constraints in the instructions as well as compiler, which we cannot disclose publicly. For Blackwell 12.0, the compiler often outputs IADD3 instead of IADD: this should be improved soon.
Because of the above, it is unlikely that current benchmarks are able to achieve this 2x throughput, unless they are hand-crafted without relying on the compiler.
Note that this is the case for some other entries in the instruction throughput table: it only lists the theoretical maximum throughput, but we cannot always disclose how to precisely achieve this in practice.
If you have further questions, I can try answering them.