# Throughput for certain integer arithmetic instructions.

In looking through the arithmetic instruction throughput table https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions, two questions come to mind.

1. On 3.x devices, how is throughput = 160 for integer instructions (add/compare/xor, etc.)? The SM has 6 sets of 32 Cores. Does one of these 6 sets not have these instructions? Or are only 16 of the 32 Cores shared by each pair of warp schedulers implemented, so that a dual issue instruction has to be issued over two cycles? Can each warp issue at least one per cycle? Refer to https://www.hardware.fr/medias/photos_news/00/44/IMG0044011_1.jpg
2. On 5.x devices and above, what about the LOP3 instructions? Are they included with AND/OR/XOR?

The GPU SM has a large collection of functional units. Different functional units service different types of instructions. The thing you are calling compute units, more commonly called “cores” are actually single-precision floating-point units. They service specific instructions such as FADD, FMUL, FFMA, and little else (i.e. not integer instructions), and operate on 32-bit floating point operands only. (There are different functional units, in different quantities, for 64-bit floating point arithmetic.) Integer computations get done in different functional units. And there are apparently 160 of these in the Kepler SM.

(see revision below)

AFAIK throughput of LOP3 is not formally documented.

As far as throughput is concerned, I believe the LOP3 instruction has the same throughput as AND, OR, and XOR as this instruction was added as an optimization for code performing lots of logical operations, e.g. crypto currencies. There may be occasional performance-sapping complications with register bank usage as it requires three source operands. Whether LOP3 physically goes through the same functional unit as AND, OR, XOR I do not know.

Note that regular LOP can apply NOT to one (or even both? I forget) of the source operands, so if your logic operation is a simple combination of one of AND, OR, XOR with NOT of one of the source operands, it would be better to stick with LOP instead of using LOP3.

On 3.x devices, the throughput of 32-bit integer multiply and multiply-add is only 32. Does it mean the integer multiply and integer add are executed on different units? If so, are there 32 units per SM used for 32-bit integer multiply?

I don’t think this is well documented anywhere but I believe my previous comment in this post may have significant inaccuracies, so I would like to take the opportunity to revise it. To preserve the record I will write my revision below.

Some data:

The GK210 whitepaper:

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

says:

“Each of the Kepler GK110/210 SMX units feature 192 single-precision CUDA cores, and each core has
fully pipelined floating-point and integer arithmetic logic units.”

as its only reference to integer processing description.

Table 3:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions

reports a throughput of 160 for integer add and 32 for integer multiply. This doesn’t seem to quite line up with the statement in the whitepaper. However I think both are probably correct (160+32 = 192 may be significant)

Revision to previous statement:

The GPU SM has a large collection of functional units. Different functional units service different types of instructions. The thing you are calling compute units, more commonly called “cores” are typically actually single-precision floating-point units, however as noted above in the case of Kepler they service both integer instructions and floating point instructions such as FADD, FMUL, FFMA. (There are different functional units, in different quantities, for 64-bit floating point arithmetic.)

I apologize for the previous error(s).

I also acknowledge that this does not address the questions raised in the most recent prior post in this thread.

Thank you for the correction.
Is dual issue restricted for some instruction pairs so that the 192 cores cannot be fully utilized for integer add and multiply?
Even so the throughput of integer multiply cannot be as low as 32.
Or is warp schedulers NOT allowed to schedule integer multiply instructions in consecutive cycles?