Why is this a uint64_t subtraction and bit shift generating bottleneck?

Hi.

I use Nsight Compute to profile my code.

Here the (hopefully) relevant excerpts:

__constant__ char array[] = {
    11, 63, 23, 09, 43, 19, 32, 19,
    23, 07, 13, 22, 43, 52, 22, 53,
    28, 34, 52, 61, 32, 63, 54, 12,
    38, 62, 54, 43, 28, 45, 44, 06,
    23, 42, 37, 46, 54, 23, 45, 27,
    12, 34, 62, 17, 62, 35, 63, 42,
    36, 42, 18, 43, 28, 34, 36, 38,
    43, 52, 67, 56, 43, 01, 24, 57
};

This is the code which generates all king of colors in the Nsight Compute Source/Source report:

uint64_t a = 1 - array[i]; // line 1
uint64_t t2 = input >> t1; // line 2

Sampling Data (All) column shows 1.210.744 samples for line 1.

0.00% Misc(2)
0.14% Dispatch Stall (1730)
0.33% Wait (3971)
0.38% Selected (4577)
1.50% No Instructions (18152)
2.18% Math Pipe Throttle (26367)
3.09% Short Scoreboard (37420)
4.98% Not selected (60355)
87.40% Mio. Throttle (1058170)

Line 2 is not limiting (0 samples).

If I change the code into this (avoiding subtraction in line 1) …

uint64_t a = array[i]; // line 1
uint64_t t2 = input >> t1; // line 2

Sampling Data (All) column shows 0 samples for line 1, but line 2 shows a similar bottleneck for the bit shift operations as the subtraction before.

Why is a simple subtraction / bit shift eating up all this performance?
Where can I find what all the restricting factors (list above) mean?

It seems to me your major bottleneck is this:

I am not sitting in front of the profiler, but I think “Mio” is memory I/O. In other words, the cores are waiting to insert new memory operations into the I/O queue. This would suggest your code is bound by memory throughput, not computation. Since I cannot tell what is going on in your code overall, I would simply suggest double-checking the memory aspect.

Defining array as constant and accessing via an index will result in use of indexed constant cache. Depending on the GPU there is one indexed constant cache per SM or TPC resulting in a maximum access rate of 0.5 or 1 instruction per cycle per SM. The indexed constant cache is accessed via the MIO unit. The low instruction throughput will result in warps stalling to issued the LDC (load constant instruction). If threads have different indexes then the instruction is replayed in the MIO ADU (address divergence unit) for each divergence increasing the number of warps that are stalling on MIO throttle.

If you would like more information on possible solutions it would be helpful if you would provide the following information:

  • GPU
  • Calculation of i
  • Number of times each warp will perform the above code

If all threads have the same value i you are accessing this at maximum performance. L1 and shared memory would give the same performance. If all threads in the warp have different value of i then shared memory and/or cached L1 would be up to 31x more performant. If each warp only reads once then I would recommend trying cached L1 first. If warps are reading the data more than once then you may want to use shared memory.