Throughput for certain integer arithmetic instructions.

michaelrrolle45 · August 12, 2019, 8:25pm

In looking through the arithmetic instruction throughput table https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions, two questions come to mind.

On 3.x devices, how is throughput = 160 for integer instructions (add/compare/xor, etc.)? The SM has 6 sets of 32 Cores. Does one of these 6 sets not have these instructions? Or are only 16 of the 32 Cores shared by each pair of warp schedulers implemented, so that a dual issue instruction has to be issued over two cycles? Can each warp issue at least one per cycle? Refer to https://www.hardware.fr/medias/photos_news/00/44/IMG0044011_1.jpg
On 5.x devices and above, what about the LOP3 instructions? Are they included with AND/OR/XOR?

Robert_Crovella · August 12, 2019, 8:41pm

The GPU SM has a large collection of functional units. Different functional units service different types of instructions. The thing you are calling compute units, more commonly called “cores” are actually single-precision floating-point units. They service specific instructions such as FADD, FMUL, FFMA, and little else (i.e. not integer instructions), and operate on 32-bit floating point operands only. (There are different functional units, in different quantities, for 64-bit floating point arithmetic.) Integer computations get done in different functional units. And there are apparently 160 of these in the Kepler SM.

(see revision below)

AFAIK throughput of LOP3 is not formally documented.

(Yes, I’m aware I didn’t answer all your questions.)

njuffa · August 12, 2019, 9:07pm

As far as throughput is concerned, I believe the LOP3 instruction has the same throughput as AND, OR, and XOR as this instruction was added as an optimization for code performing lots of logical operations, e.g. crypto currencies. There may be occasional performance-sapping complications with register bank usage as it requires three source operands. Whether LOP3 physically goes through the same functional unit as AND, OR, XOR I do not know.

Note that regular LOP can apply NOT to one (or even both? I forget) of the source operands, so if your logic operation is a simple combination of one of AND, OR, XOR with NOT of one of the source operands, it would be better to stick with LOP instead of using LOP3.

lxzhang · January 15, 2020, 3:59am

On 3.x devices, the throughput of 32-bit integer multiply and multiply-add is only 32. Does it mean the integer multiply and integer add are executed on different units? If so, are there 32 units per SM used for 32-bit integer multiply?

Robert_Crovella · January 15, 2020, 5:12am

I don’t think this is well documented anywhere but I believe my previous comment in this post may have significant inaccuracies, so I would like to take the opportunity to revise it. To preserve the record I will write my revision below.

Some data:

The GK210 whitepaper:

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

says:

“Each of the Kepler GK110/210 SMX units feature 192 single-precision CUDA cores, and each core has
fully pipelined floating-point and integer arithmetic logic units.”

as its only reference to integer processing description.

Table 3:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions

reports a throughput of 160 for integer add and 32 for integer multiply. This doesn’t seem to quite line up with the statement in the whitepaper. However I think both are probably correct (160+32 = 192 may be significant)

Revision to previous statement:

The GPU SM has a large collection of functional units. Different functional units service different types of instructions. The thing you are calling compute units, more commonly called “cores” are typically actually single-precision floating-point units, however as noted above in the case of Kepler they service both integer instructions and floating point instructions such as FADD, FMUL, FFMA. (There are different functional units, in different quantities, for 64-bit floating point arithmetic.)

I apologize for the previous error(s).

I also acknowledge that this does not address the questions raised in the most recent prior post in this thread.

lxzhang · January 15, 2020, 5:45am

Thank you for the correction.
Is dual issue restricted for some instruction pairs so that the 192 cores cannot be fully utilized for integer add and multiply?
Even so the throughput of integer multiply cannot be as low as 32.
Or is warp schedulers NOT allowed to schedule integer multiply instructions in consecutive cycles?

Topic		Replies	Views
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20227	March 12, 2014
LOP3 Throughput CUDA Programming and Performance	1	1436	July 26, 2019
throughput of integer add CUDA Programming and Performance	17	3278	August 15, 2011
Peak Performance of integer operation CUDA Programming and Performance	3	2962	May 11, 2017
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	3627	October 5, 2022
192 cuda cores - how they are organized 6x32 or 4x32 + 4x16? CUDA Programming and Performance	5	3267	April 29, 2012
Is it possible to have FP Unit and INT Unit in a same core work in parallel? CUDA Programming and Performance	11	3942	March 5, 2019
The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM CUDA Programming and Performance	5	1297	January 7, 2014
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	932	September 29, 2018
Blackwell Integer CUDA Programming and Performance	159	5405	October 31, 2025

Throughput for certain integer arithmetic instructions.

Related topics