Hi,
I hope there’s any CUDA engineer that wants to read a wall of text nearby :)
If you don’t want to read the whole story, you can skip to the text in bolt!!
I was optimizing an hash cracking kernel, identifying its bottleneck;
to do this I’m counting directly the number of instructions per class
(both in the CUDA code, than in the compiled code, dumped with the cuobjdump tool)
and evaluating the throughput of the MP per instruction class.
I discovered some interesting facts about how bit rotation
(that I’ve implemented using left shift and right shift) is actually compiled.
Using a c.c. 1.1 target, I found the SHL and SHR instructions,
while using a c.c. 3.0 target, I found a SHL followed by a MAD.HI instruction.
The MAD.HI emulates the SHR instruction (a << N) doing a multiply (a * 2^(32-N))
that actually shift left by 32 - N in a temporary 64-bit register, then the .HI states
that the most significant 32-bit half of the result should be taken and added to the resulting register.
I used a script to count the instructions in my compiled functions,
then I tried to calculate the theoretical throughput of the “useful” part of my code
(that is excluding the overhead).
The problem is:
for c.c. 1.1,
I calculated the theoretical throughput considering that
ADD, logical operations and shifts are executed on the 8 cores of the MP.
(I found 10 op/clock per MP for ADD instructions on the cuda programming guide,
maybe ADD instructions are sent both to the 8 cores and to the 2 SFUs?)
The real throughput is very near to the theoretical throughput (51 MKey/s real, 52,5 MKey/s theoretical).
BUT
for c.c. 3.0,
calculating this way, gives me a theoretical throughput much smaller than
the actual throughput (impossible!!).
I know that my bottleneck is on shift/IMAD operations; If I calculare throughput
considering just these operations, I get a theoretical throughput very near
to my expectations (1323 MKey/s real, 1333 MKey/s theoretical)
I saw on the programming guide
- 160 op/clock per IADD/logical
- 32 op/clock per SHIFT/MAD
SO I was wondering:
It’s possible that Kepler multiprocessor actually has 2 different pipelines for these class of instructions?
That is, IADDs and AND/OR/XOR are sent to 160 of the 192 cores,
while, SHIFT/IMAD are sent to the remaining 32 cores of the 192 cores ??
So that the two classes go in parallel, and so
in my kernel the bottleneck (shift+mad) completely hides IADDS instructions???