Tricks to fight register pressure or how I got down from 29 to 15 registers.

Today I succeeded in getting a kernel down from 29 registers to 15 registers. This now allows me to run 512 threads in a block and perform more work with a smaller grid. I decided to share some of the tricks I used to accomplish this.

First I traded some several thread local variables for use of shared memory where it was possible. For example variables accumulating contributions to a final result are good candidates to be moved to shared memory. Keep in mind that with 512 threads you will require 2048 bytes of storage every float or int variable that you move to shared memory. My shared memory utilization is now close to 16k, but at least I can run with all 512 threads.

I put often used index variables into registers using the “volatile trick”, just like this. It has the effect of keeping the indices around in registers so they are readily available anywhere (without needing to be pulled from shared memory first, and using temporary registers in the process)

__global__ static void Kernel()

{

   volatile unsigned int ue_thread	= threadIdx.x;

   volatile unsigned int sector_index = threadIdx.y;

   volatile unsigned int ue_modulo	= threadIdx.z;

// above three variables are used a LOT in accessing arrays of various types.

// The difference between not having volatile in the above lines and including

// this keyword is 23 vs 15 registers used in my kernel!

}

I make heavy use of dynamic shared memory assignment, where the host passes precomputed offsets to various shared memory arrays into the kernel (and of course the total size of dynamic shared memory as the third kernel launch argument). Pointers to these shared memory arrays can be created quickly and only where needed. Due to my use of the volatile keyword these pointers end up in a register providing efficient access. Absolutely try to avoid declaring the shared memory pointers at kernel level scope (e.g. at the very beginning of the kernel) or the pointer will stick around and use a register all of the time.

__global__ static void Kernel(unsigned int offset1, unsigned int offset2, ...)

{

	extern __shared__ unsigned char shared[];

	// I declare pointers to such arrays in LOCAL scope instead of at kernel level scope, whenever

	// I need them. 

	for (/* some computation loop */)

	{

		if (some condition)

		{

			// now declare this array pointer at local scope

			unsigned char (* volatile s_fading)[128] = (unsigned char (*)[128])&shared[offset1];

			// s_fading can now be accessed as a two dimensional uchar array of size [x][y] with y=128 

			// and x being unspecified (just reserve enough bytes of shared mem for the highest expected x)

			s_fading[sector_index][ue_thread] = ...

		} 

	  // s_fading is now out of scope again, this avoids the compiler to keep it around and

	  // waste a register

	}

}

Generally I try to avoid declaring too many variables at kernel level scope. In many cases it is more efficient to declare (and recompute) index variables several times within local scopes. You just don’t want to do this in the innermost loops (the “hottest” code).

Feel free to append more tips how to reduce your register count.

Thanks, these are very useful tips.

cbuchner1,

Thanks for sharing. I’d be interested to know if you got better performance and by how much (roughly).

Regards.

In my code I incur a significant per-block overhead (in terms of global memory access to read some configuration data which is too big to fit into constant memory). So I was able to speed up my code by about factor 2. A smaller grid means less overhead in my case.

Well this discussion goes back to 7 years ago, so it is likely to be out of interest now. If you still follow this discussion, I am a bit puzzled by the ‘volatile trick’. By declaring volatile, we are basically telling the compiler to locate the variable on global memory. Hence, an access penalty of about ~100x, but yeah, you’re right, now we have more registers available.
So, the question is which one is more beneficial? Allocating variables that are likely to be get called often on global memory and accepting the latency in accessing them OR reducing the number of threads per block so that the all registers can be fit in a block?

the trick only worked on very old, non-LLVM based CUDA compilers. CUDA 1.x and 2.x releases were still using the Open64 compiler infrastructure, and for some reasons declaring variables as volatile worked to some extent to reduce register pressure.

Currently I can not recommend this anymore.

For cuda 10.1 and Turing GPU, the volatile still works, it can reduce register number.

interesting. Likely that’s because it will keep values around in the register for a shorter period of time because data will be fetched from its source location (e.g. global memory) every single time it is used.

So I think you might be trading register use vs. increased memory access here.

Maybe you can share some piece of code (a kernel) where this trick worked for you? I’d love to poke around in it.

Christian

Declaring variables as volatile is likely interfering with various compiler optimizations. For example:

[1] It reduces the “mobility” of load instructions. Loads cannot be scheduled as early as they otherwise could and heir data “cached” in registers.

[2] It reduces the ability to extract common subexpressions, which might otherwise be computed once and “cached” in registers.

Since it is not exactly known how compiler optimizations will be affected, and the effect may differ by GPU architecture and compiler version, I consider this use of ‘volatile’ a hack (or an abuse, depending on my mood :-). The trade-off here is between per-thread efficiency and the total number of threads running.

Since the end result of reducing register usage in this way may be increased performance, this may still be a useful hack. But any code using it should be closely monitored as the code is maintained over time, as the brittle nature of the hack means it could at some point become a pessimization even though it started out as an optimization.