Integer instructions performance on Kepler

Hi,

I hope there’s any CUDA engineer that wants to read a wall of text nearby :)
If you don’t want to read the whole story, you can skip to the text in bolt!!

I was optimizing an hash cracking kernel, identifying its bottleneck;
to do this I’m counting directly the number of instructions per class
(both in the CUDA code, than in the compiled code, dumped with the cuobjdump tool)
and evaluating the throughput of the MP per instruction class.

I discovered some interesting facts about how bit rotation
(that I’ve implemented using left shift and right shift) is actually compiled.

Using a c.c. 1.1 target, I found the SHL and SHR instructions,
while using a c.c. 3.0 target, I found a SHL followed by a MAD.HI instruction.
The MAD.HI emulates the SHR instruction (a << N) doing a multiply (a * 2^(32-N))
that actually shift left by 32 - N in a temporary 64-bit register, then the .HI states
that the most significant 32-bit half of the result should be taken and added to the resulting register.

I used a script to count the instructions in my compiled functions,
then I tried to calculate the theoretical throughput of the “useful” part of my code
(that is excluding the overhead).

The problem is:
for c.c. 1.1,
I calculated the theoretical throughput considering that
ADD, logical operations and shifts are executed on the 8 cores of the MP.
(I found 10 op/clock per MP for ADD instructions on the cuda programming guide,
maybe ADD instructions are sent both to the 8 cores and to the 2 SFUs?)
The real throughput is very near to the theoretical throughput (51 MKey/s real, 52,5 MKey/s theoretical).

BUT

for c.c. 3.0,
calculating this way, gives me a theoretical throughput much smaller than
the actual throughput (impossible!!).
I know that my bottleneck is on shift/IMAD operations; If I calculare throughput
considering just these operations, I get a theoretical throughput very near
to my expectations (1323 MKey/s real, 1333 MKey/s theoretical)

I saw on the programming guide

  • 160 op/clock per IADD/logical
  • 32 op/clock per SHIFT/MAD

SO I was wondering:

It’s possible that Kepler multiprocessor actually has 2 different pipelines for these class of instructions?
That is, IADDs and AND/OR/XOR are sent to 160 of the 192 cores,
while, SHIFT/IMAD are sent to the remaining 32 cores of the 192 cores ??
So that the two classes go in parallel, and so
in my kernel the bottleneck (shift+mad) completely hides IADDS instructions???

I am afraid I cannot help with the pipeline structure and throughputs beyond what is documented. As the data changes from GPU to GPU I have to look it up in documentation myself.

I am wondering about the instruction idioms you observed. Secure hashes I have looked at (e.g. MD5, SHA1) use 32-bit rotates with a fixed shift amounts. Last I looked, each such rotate resulted in the following machine code level (SASS):

sm_1x: SHR, SHL, LOP.OR
sm_2x, sm_30: SHR, ISCADD
sm_35: SHF

where ISCADD is an “integer scale and add” and SHF is “funnel shift”. As far as I know, the above patterns are optimal for the respective architectures and are limited by the shift throughput. I am surprised a multiply is being generated in you case; maybe that is some trick I am not aware of. Do you observe this with CUDA 5.5?

Like other compilers, the LLVM-based front-end of the CUDA compiler recognizes some common source code idioms for rotates, so there should not be much variability in the generated machine code. The following rotate idioms are recognized from what I could observe (where n is a literal constant between 1 and 31):

(x << n) | (x >> (32 - n))
(x << n) + (x >> (32 - n))

Interesting, I’m working actually on MD5 and SHA1 implementations.
Compiling with CUDA 5.5.
Rotation is implemented this way, in a define: (x << n) | (x >> (32 - n)).

I looked at the ISCADD usage:
https://code.google.com/p/asfermi/wiki/OpcodeInteger

It seems that the two assemblies variations are actually the same
but with different order!

Mine was compiled in this order
(x << n) SHL

  • (x >> (32 - n)) IMAD.HI

yours,
(x >> (32 - n)) SHR

  • (x << n) SCADD

it looks like that SCADD, IMAD, SHL, SHR have the same throughput
for c.c. 2.0 … 3.0; so, probably they give the same performances.

It would be useful to know where they are executed on hardware to correctly
calculate the theoretical throughput.

I know that ISCADD and IMAD have the same throughput on sm_20. The compiler should prefer ISCADD in cases where it is functionally equivalent to IMAD. In your case the shift operations seem to be merged differently, such that a later conversion of IMAD is no longer possible. The two variants should have the same performance on sm_20, however, so there is nothing to worry about. I do not know any shift / multiply execution details for sm_30.

While there were some minor code generation issues with SHA1 and MD5 source codes in the past (in particular, NOT operations were not always merged into dependent logic operations), this should be fixed in CUDA 5.5 and these codes should compile into the minimum number of instructions possible.

Ok, thanks.
I looked at NOT operations and they are merged,
but probably I’ve an early version of CUDA 5.5 installed (5.5.0).

I think that if kepler (3.0) MP sends ADD/32/32i to 160 of the 192 cores,
and shifts/mad/iscadd to the remaining 32 cores should definely
explain the actual throughput of the MD5 lookup kernel.
It would actually explain to me also why integer add throughput
(how it’s stated in the CUDA programming guide)
is just 160 and not 192.

In the fermi architecture it seems that ADD warps map to 32 or 48 cores (2.0 or 2.1),
and shifts warps to just a set of 16 cores (so they share the same cores,
but these functionalities are implemented just in a set of 16,
but the difference with kepler is that they don’t follow separate pipelines).

I hope that some nvidia engineer would confirm that!

Hey, njuffa, seems like you actually are a Nvidia engineer :D

This info is important because:
for example, in a MD5, if the shifts/mad/iscal become the performance bottleneck (like in 3.0)
and the pipelines are separate then
also if you reduce ADD/XOR/AND/OR instructions (even to 0!)
the performance will stay the same!

As corollary, you could swap some shifts with ADDs gaining performances
to a certain threshold…

e.g.
unsigned a << N could be realized using N ADDs (that don’t saturated yet their pipeline):

t = a + a // a * 2 —> a << 1
t = t + t // a * 4 —> a << 2
t = t + t // a * 8 —> a << 3
t = t + t // a * 16 —> a << 4
t = t + t // a * 32 —> a << 5