Changing one bit in mask slows down code significantly

I spent quite some time trying to track down this performance issue inside kernel function below.

uint64_t x[32] = { 0 };
uint64_t y = 0;
uint64_t z;

for (i = 0; i < 32; i++) {
	// x[] is populated
	// z is computed
	y = x[i] & 0xe3e3e3e3e3e3e3ff; // fast
//	y = x[i] & 0xe3e3e3e3e3e3e3fe; // if this line is used instead, it slows down the code by 35%
	// ...
	if (y == z)

Changing the mask value’s LSB (1 bit only) makes the code 35% slower. I ran test after test … it could always be reproduced. What goes on here?

Yes, the complete code would probably tell you exactly where the performance issue comes from, but the too code is lengthy to post and I have no clue what might be relevant.

I’d like to know if I missed something basic in the code above?

The most obvious change is that with a mask of 0xe3e3e3e3e3e3e3fe, y == 123 can never be true, because y will always be an even number. There may be similar effects outside the code shown. The CUDA profiler should be able to help you to track down the salient differences between the two variants.

z was badly chosen. I changed the code above so that if (y == z) can be true or false (that’s what happens inside the real code). So it’s not the situation where the compiler detects impossible conditions and doesn’t compile this part in the final binary since never used anyway, thus improving performance.

I’m pretty new in CUDA and have to admit, that sometimes it surprises me since I never really cared about performance. I also never really made it to the profilers. This might be a good opportunity. Which profiling tool I should use to see what happens in my code above?

That’s going to eat up the entire next month I guess.
Ok … I have to start somewhere.
nvprof is depreciated as of CC 7.5.
I think Nsight Compute might be the right start?
Will these graphs tell me why mask value is capable of reducing my performance?


Digging myself through the the profile pages. Overwhelming amount of information and no real plan/guideline where to start, so I went for Nsight Compute. I believe that the following might be a good start.

Theoretical Occupancy : 37.50 %
Theoretical Active Warps per SM : 12 warps
Achieved Occupancy : 37.31 %
Achieved Active Warps per SM : 11.94 warps
This kernel's theoretical occupancy (37.5%) is limited by the number of required registers.

Does this mean that my kernel requires too many registers?
It takes 168 per thread.

Not necessarily, it’s just stating the fact that a higher occupancy cannot be achieved at this number of registers. Just how much occupancy is required is very much dependent on the task at hand. If your kernel is suffering badly from a lack of latency hiding, then maybe more occupancy will help.

While not directly addressing your particular issue, you might find the three part series Robert has written of use:

Also, as you’re new to Cuda, there are a series of nine lectures - sets of slides from each, that may be informative:


Thanks for the links.

My code uses embedded assembler, very little host/device data transfers and little device memory. Also, there’s no data exchange between threads (requiring shared memory).

Looking at Nsight Compute:

  • GPU SOL Throughput is fine (Compute 82% and Memory 5%)
  • ALU is heavily used

Basically all other sections issue warnings for the rules. Many sections point to the source counters due warp stalls. The highest count comes from BSYNC B6, which is (if I’m not mistaken) not avoidable.

Basically, I don’t really know where to start the optimization using Nsight Compute. I read already quite some docs, checked some video tutorials, but the the learning curve to find the bottlenecks (if any) is pretty flat.

I believe I’m pretty close to the maximum achievable performance, but I “feel” there are another 10-20% which I might get out of it. But where to start using Night Compute, if the warning messages in the sections mutually point to each other?

A very vague comment based on your comments would be that if you’re using 168 reg/thread and warp stalls are limiting performance, then the place to start is by trying get the register count down so you can fit in another block.

168 registers per thread is unusually high, what is driving this? The snippet earlier in the thread showed code operating on 64-bit integers. Since GPUs are 32-bit architectures, using 32-bit integers where possible would seem to be one approach to potentially reduce code complexity and register usage.

Having some anecdotal / vague notions about the code of interest is unlikely to result in specific pertinent recommendations.

The 168 registers were also my first guess. When I do a count the variables in device code, I come up with approximately 320 (32-bit) variables, which should be in the registers. Checking the device code thoroughly from top to bottom, I notice that some variables are only used at the beginning, hence theoretically allowing the compiler to re-use these registers (I don’t know if the compiler does that?). Considering above, I come up with +/- 220 registers. I never get the number down to 168.

Is it possible to use Nsight Compute to see which variables are used by the registers?

The code is transposing matrices, lots of bit-twiddling. 95% is 32-bit (which I aim for), but some part requires 64-bit (the shift to 32-bit would be expensive). I know it’s difficult to express recommendations without code, but the idea is to understand how to use Night Compute in order to track performance bottlenecks, like the one mentioned in the OP.

I hope above screenshot triggers more suggestions.

Well, the screenshot shows you are definitely using 168 reg/thread.
Is there a reason you aren’t using any shared memory? Doing so could take pressure off registers.
Really without knowing anything about the task, there’s not much to offer. If you add the compiler option --ptxas-options=--verbose , you will be able to see how many registers are being used and whether you are spilling to local memory - something you don’t want to do.

Adding -lineinfo will allow you to view your source beside the generated SASS in Nsight Compute and you will be able to see how variables are being utilised.

Is there a logical place to split this task into 2 or more stages, so you can spread it over more than one kernel?

That’s a standard part of what any compiler (except possibly extremely simple ones) does: live range analysis. There is no fixed assignment of the variables in the program code to hardware registers.

I can’t use shared memory since the “big” uint32_t array data changes for every thread.

1>ptxas info    : 1086 bytes gmem, 1032 bytes cmem[3]

1>ptxas info    : Compiling entry function '_Z16kernel_newyPKjyP4HitsPi' for 'sm_52'
1>ptxas info    : Function properties for _Z16kernel_newyPKjyP4HitsPi
1>    272 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 168 registers, 256 bytes smem, 360 bytes cmem[0], 24 bytes cmem[2]

Thanks for the hints. I’ll try to dig myself through …