Why is this a uint64_t subtraction and bit shift generating bottleneck?

geohei · January 7, 2021, 9:08pm

Hi.

I use Nsight Compute to profile my code.

Here the (hopefully) relevant excerpts:

__constant__ char array[] = {
    11, 63, 23, 09, 43, 19, 32, 19,
    23, 07, 13, 22, 43, 52, 22, 53,
    28, 34, 52, 61, 32, 63, 54, 12,
    38, 62, 54, 43, 28, 45, 44, 06,
    23, 42, 37, 46, 54, 23, 45, 27,
    12, 34, 62, 17, 62, 35, 63, 42,
    36, 42, 18, 43, 28, 34, 36, 38,
    43, 52, 67, 56, 43, 01, 24, 57
};

This is the code which generates all king of colors in the Nsight Compute Source/Source report:

uint64_t a = 1 - array[i]; // line 1
uint64_t t2 = input >> t1; // line 2

Sampling Data (All) column shows 1.210.744 samples for line 1.

0.00% Misc(2)
0.14% Dispatch Stall (1730)
0.33% Wait (3971)
0.38% Selected (4577)
1.50% No Instructions (18152)
2.18% Math Pipe Throttle (26367)
3.09% Short Scoreboard (37420)
4.98% Not selected (60355)
87.40% Mio. Throttle (1058170)

Line 2 is not limiting (0 samples).

If I change the code into this (avoiding subtraction in line 1) …

uint64_t a = array[i]; // line 1
uint64_t t2 = input >> t1; // line 2

… Sampling Data (All) column shows 0 samples for line 1, but line 2 shows a similar bottleneck for the bit shift operations as the subtraction before.

Why is a simple subtraction / bit shift eating up all this performance?
Where can I find what all the restricting factors (list above) mean?

njuffa · January 7, 2021, 10:10pm

It seems to me your major bottleneck is this:

I am not sitting in front of the profiler, but I think “Mio” is memory I/O. In other words, the cores are waiting to insert new memory operations into the I/O queue. This would suggest your code is bound by memory throughput, not computation. Since I cannot tell what is going on in your code overall, I would simply suggest double-checking the memory aspect.

Greg · January 10, 2021, 7:18pm

Defining array as constant and accessing via an index will result in use of indexed constant cache. Depending on the GPU there is one indexed constant cache per SM or TPC resulting in a maximum access rate of 0.5 or 1 instruction per cycle per SM. The indexed constant cache is accessed via the MIO unit. The low instruction throughput will result in warps stalling to issued the LDC (load constant instruction). If threads have different indexes then the instruction is replayed in the MIO ADU (address divergence unit) for each divergence increasing the number of warps that are stalling on MIO throttle.

If you would like more information on possible solutions it would be helpful if you would provide the following information:

GPU
Calculation of i
Number of times each warp will perform the above code

If all threads have the same value i you are accessing this at maximum performance. L1 and shared memory would give the same performance. If all threads in the warp have different value of i then shared memory and/or cached L1 would be up to 31x more performant. If each warp only reads once then I would recommend trying cached L1 first. If warps are reading the data more than once then you may want to use shared memory.

Topic		Replies	Views
Changing one bit in mask slows down code significantly CUDA Programming and Performance	12	806	January 20, 2022
Performance optimization? CUDA Programming and Performance	6	5342	October 28, 2007
Constant memory provides no improvement CUDA Programming and Performance cuda , algorithm	16	241	January 17, 2025
First kernel run is slower than succeeding CUDA Programming and Performance	9	2882	May 16, 2022
Optimized shared memory index mapping function CUDA Programming and Performance	16	1492	March 20, 2023
Pipeline operator forwarding for integer instructions in CUDA CUDA Programming and Performance cuda , kernel	25	579	July 15, 2024
Random loads of small unsigned integer values, __constant__ vs. __shared__ vs. global CUDA Programming and Performance	13	1513	August 9, 2015
Uint64_t compare operation slows down device code dramatically CUDA Programming and Performance cuda , performance	7	1619	October 12, 2021
shard memory accesses is slow... CUDA Programming and Performance	27	9450	December 24, 2011
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3778	April 2, 2015

Why is this a uint64_t subtraction and bit shift generating bottleneck?

Related topics