Hi.
I use Nsight Compute to profile my code.
Here the (hopefully) relevant excerpts:
__constant__ char array[] = {
11, 63, 23, 09, 43, 19, 32, 19,
23, 07, 13, 22, 43, 52, 22, 53,
28, 34, 52, 61, 32, 63, 54, 12,
38, 62, 54, 43, 28, 45, 44, 06,
23, 42, 37, 46, 54, 23, 45, 27,
12, 34, 62, 17, 62, 35, 63, 42,
36, 42, 18, 43, 28, 34, 36, 38,
43, 52, 67, 56, 43, 01, 24, 57
};
This is the code which generates all king of colors in the Nsight Compute Source/Source report:
uint64_t a = 1 - array[i]; // line 1
uint64_t t2 = input >> t1; // line 2
Sampling Data (All)
column shows 1.210.744 samples for line 1.
0.00% Misc(2)
0.14% Dispatch Stall (1730)
0.33% Wait (3971)
0.38% Selected (4577)
1.50% No Instructions (18152)
2.18% Math Pipe Throttle (26367)
3.09% Short Scoreboard (37420)
4.98% Not selected (60355)
87.40% Mio. Throttle (1058170)
Line 2 is not limiting (0 samples).
If I change the code into this (avoiding subtraction in line 1) …
uint64_t a = array[i]; // line 1
uint64_t t2 = input >> t1; // line 2
… Sampling Data (All)
column shows 0 samples for line 1, but line 2 shows a similar bottleneck for the bit shift operations as the subtraction before.
Why is a simple subtraction / bit shift eating up all this performance?
Where can I find what all the restricting factors (list above) mean?