Blackwell Integer

mjoux · June 11, 2025, 7:27am

Hi, I’m part of the team responsible for the instruction throughput table in the CUDA programming guide, and I wanted to give you an update on this thread from my point of view:

First, thank you all for raising these issues!
With CUDA 13.0, we will make significant changes to the table: (1) it will be moved to the CUDA best practices guide as it is less relevant to the programming model and (2) it will be re-structured, and will have example PTX instructions which will hopefully provide a little more clarity, although it is obviously far from perfect.

Concerning Blackwell integer instruction throughput specifically:

As pointed out in the thread already, some instructions have not been improved/changed: this applies to IMAD, LOP3, PRMT for example, as well as IADD3.
The main improved instructions relevant to this thread are IADD, IMNMX/VIMNMX, FSETP/ISETP: addition of 2 operands and min/max/compare.
Concerning integer addition specifically, it is even more complicated: previous architectures already had the possibility to achieve 2x throughput by combining e.g. IADD3 with IMAD.IADD or VIADD (for 9.0/10.0). Blackwell 12.0 now allows achieving this 2x throughput with a single instruction: IADD. But note that it can be difficult to get a sequence of instructions achieving higher throughput: for previous architectures, this is because of constraints in the instructions as well as compiler, which we cannot disclose publicly. For Blackwell 12.0, the compiler often outputs IADD3 instead of IADD: this should be improved soon.

Because of the above, it is unlikely that current benchmarks are able to achieve this 2x throughput, unless they are hand-crafted without relying on the compiler.
Note that this is the case for some other entries in the instruction throughput table: it only lists the theoretical maximum throughput, but we cannot always disclose how to precisely achieve this in practice.

If you have further questions, I can try answering them.

Topic		Replies	Views
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20008	March 12, 2014
Forward looking GPU integer performance CUDA Programming and Performance	22	21832	March 20, 2017
throughput of integer add CUDA Programming and Performance	17	3122	August 15, 2011
Mythical Tflops CUDA Programming and Performance	11	1170	January 14, 2019
Questions on RTX5090 Integer throughput CUDA Programming and Performance	3	48	August 27, 2025
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50006	August 24, 2009
Does Blackwell support INT4 native? CUDA Programming and Performance	12	514	April 20, 2025
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	3234	October 5, 2022
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7864	January 2, 2018
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	218	January 19, 2025

Blackwell Integer

Related topics