How to decrease register used per thread

isaaclee2313 · December 3, 2018, 7:27am

I am hitting around 50% occupancy on my GTX 1080 with 64 register per thread. The bottleneck that is preventing higher occupancy is the number of registers per thread. I could just force a maximum number of threads to be lower than 64 through a compilation flag, but that would lead to leakage into the local memory.

From answers in other forums, the general way to decrease the number of registers used per thread is to:

take advantage of the shared memory, since shared memory has higher bandwidth than local memory
breakup a big kernel into smaller kernels, where each kernel computes a portion of what the big kernel computed and save the partial output to a global memory.

The general consensus seemed to be that it is hard to predict which values will be in the register because nvcc will heavily optimize the output.

My question is:

does using less statically sized local arrays or local variables decrease the use of registers per thread?
does decreasing the number of parameters decrease the use of registers per thread?
are there any other methods to decrease registers per thread (without leakage) and increase occupancy?

thanks!

saulocpp · December 3, 2018, 1:23pm

1 - The less stuff the thread has to store for computation, the less registers it will use (“no free lunch”).
2 - Parameters of what? A kernel function? Try profiling a kernel with no arguments that just prints a string and then a kernel that takes 10 arguments and print all of them, then check the register usage.
3 - The answer to this one you already got in these 2 comments from forums, also keeping in mind that shared memory uses a common space with registers. The more shared memory you allocate, the less space will be available for registers. Check sharedMemPerBlock and shared memory per streaming multiprocessor.

zjw518 · December 3, 2018, 5:37pm

There’s no way this is true, right?

Robert_Crovella · December 3, 2018, 5:52pm

shared memory does not use a common space with registers

this is not true:

zjw518 · December 3, 2018, 5:54pm

Possibly thinking of L1 cache in non-Pascal GPUs?

njuffa · December 3, 2018, 6:31pm

If there were code to be looked at, it would probably make for a more fruitful discussion. Some random thoughts:

(1) Does the code use double-precision computation anywhere (possibly accidentally)? Each DP operand requires two of the GPU’s 32-bit registers to store.

(2) The use of thread-local arrays may reduce register usage, but also performance. I have used this to reduce register pressure caused by an infrequently executed code path.

(3) If the code has a lot of single-precision floating-point computation, use of -use-fast-math can often reduce register pressure, but can also have a significant negative impact on accuracy.

(4) Some math functions are fairly expensive in terms of register use, e.g. pow(), and should not be used gratuitously where simpler functions would suffice. Using sinpi(), cospi(), sincospi() instead of sin(), cos(), sincos(), where possible, can often reduce register pressure.

isaaclee2313 · December 3, 2018, 11:57pm

To njuffa, thanks again for your help!

it doesn’t use any double, only floats. Does your comment also imply that if I replace my floats with half, then two halves will be able to fit in one register?
I believe arrays which the index can be figure out at compile time are fit into register ( if the space permits ). So using thread-local arrays wouldn’t reduce register. Let me know if I’m wrong
I am currently using the flag. Could you please explain what fast-math really does, and why the precision drops?
okay, thanks for the heads up

njuffa · December 4, 2018, 2:01am

That’s correct (when full optimizations are turned on in the compiler). Whether that is preventable depends in the circumstances. In my case, the index was loop counter derived, and by preventing loop unrolling with a #pragma unroll 1, I was able to keep the array index compile-time variable.

The semantics of -use_fast_math are described in the documentation, and I am quite sure the information provided there will answer your question.

isaaclee2313 · December 4, 2018, 2:04am

Thanks njuffa!

saulocpp · December 4, 2018, 8:30am

My apologies to all. I meant L1, not registers.

Topic		Replies	Views
Register demand CUDA Programming and Performance	2	2765	September 9, 2009
spill register to shared mem CUDA Programming and Performance	2	4282	March 10, 2012
Is it possible to use more than 124 registers in kernel? CUDA Programming and Performance	15	4259	October 16, 2009
question about register and performance CUDA Programming and Performance	3	6774	September 22, 2008
NVCC chooses to use local memory while there is a lot of registers it can use CUDA Programming and Performance	10	1687	January 7, 2022
reducing register usage 12 to 10? CUDA Programming and Performance	3	2789	July 31, 2007
Registry per thread material CUDA Programming and Performance	4	947	November 19, 2012
Register vs local memory Forcing NVCC to use registers CUDA Programming and Performance	4	3166	June 29, 2007
How to reduce Local Memory Usage. CUDA Programming and Performance	19	12767	November 30, 2009
reducing the number of used registers CUDA Programming and Performance	8	6410	September 22, 2009

How to decrease register used per thread

Related topics