How to decrease register used per thread

I am hitting around 50% occupancy on my GTX 1080 with 64 register per thread. The bottleneck that is preventing higher occupancy is the number of registers per thread. I could just force a maximum number of threads to be lower than 64 through a compilation flag, but that would lead to leakage into the local memory.

From answers in other forums, the general way to decrease the number of registers used per thread is to:

  1. take advantage of the shared memory, since shared memory has higher bandwidth than local memory
  2. breakup a big kernel into smaller kernels, where each kernel computes a portion of what the big kernel computed and save the partial output to a global memory.

The general consensus seemed to be that it is hard to predict which values will be in the register because nvcc will heavily optimize the output.

My question is:

  1. does using less statically sized local arrays or local variables decrease the use of registers per thread?
  2. does decreasing the number of parameters decrease the use of registers per thread?
  3. are there any other methods to decrease registers per thread (without leakage) and increase occupancy?


1 - The less stuff the thread has to store for computation, the less registers it will use (“no free lunch”).
2 - Parameters of what? A kernel function? Try profiling a kernel with no arguments that just prints a string and then a kernel that takes 10 arguments and print all of them, then check the register usage.
3 - The answer to this one you already got in these 2 comments from forums, also keeping in mind that shared memory uses a common space with registers. The more shared memory you allocate, the less space will be available for registers. Check sharedMemPerBlock and shared memory per streaming multiprocessor.

There’s no way this is true, right?

shared memory does not use a common space with registers

this is not true:

Possibly thinking of L1 cache in non-Pascal GPUs?

If there were code to be looked at, it would probably make for a more fruitful discussion. Some random thoughts:

(1) Does the code use double-precision computation anywhere (possibly accidentally)? Each DP operand requires two of the GPU’s 32-bit registers to store.

(2) The use of thread-local arrays may reduce register usage, but also performance. I have used this to reduce register pressure caused by an infrequently executed code path.

(3) If the code has a lot of single-precision floating-point computation, use of -use-fast-math can often reduce register pressure, but can also have a significant negative impact on accuracy.

(4) Some math functions are fairly expensive in terms of register use, e.g. pow(), and should not be used gratuitously where simpler functions would suffice. Using sinpi(), cospi(), sincospi() instead of sin(), cos(), sincos(), where possible, can often reduce register pressure.

To njuffa, thanks again for your help!

  1. it doesn’t use any double, only floats. Does your comment also imply that if I replace my floats with half, then two halves will be able to fit in one register?

  2. I believe arrays which the index can be figure out at compile time are fit into register ( if the space permits ). So using thread-local arrays wouldn’t reduce register. Let me know if I’m wrong

  3. I am currently using the flag. Could you please explain what fast-math really does, and why the precision drops?

  4. okay, thanks for the heads up

That’s correct (when full optimizations are turned on in the compiler). Whether that is preventable depends in the circumstances. In my case, the index was loop counter derived, and by preventing loop unrolling with a #pragma unroll 1, I was able to keep the array index compile-time variable.

The semantics of -use_fast_math are described in the documentation, and I am quite sure the information provided there will answer your question.

Thanks njuffa!

My apologies to all. I meant L1, not registers.