Tricks to fight register pressure or how I got down from 29 to 15 registers.

Today I succeeded in getting a kernel down from 29 registers to 15 registers. This now allows me to run 512 threads in a block and perform more work with a smaller grid. I decided to share some of the tricks I used to accomplish this.

First I traded some several thread local variables for use of shared memory where it was possible. For example variables accumulating contributions to a final result are good candidates to be moved to shared memory. Keep in mind that with 512 threads you will require 2048 bytes of storage every float or int variable that you move to shared memory. My shared memory utilization is now close to 16k, but at least I can run with all 512 threads.

I put often used index variables into registers using the ā€œvolatile trickā€, just like this. It has the effect of keeping the indices around in registers so they are readily available anywhere (without needing to be pulled from shared memory first, and using temporary registers in the process)

__global__ static void Kernel()

{

   volatile unsigned int ue_thread	= threadIdx.x;

   volatile unsigned int sector_index = threadIdx.y;

   volatile unsigned int ue_modulo	= threadIdx.z;

// above three variables are used a LOT in accessing arrays of various types.

// The difference between not having volatile in the above lines and including

// this keyword is 23 vs 15 registers used in my kernel!

}

I make heavy use of dynamic shared memory assignment, where the host passes precomputed offsets to various shared memory arrays into the kernel (and of course the total size of dynamic shared memory as the third kernel launch argument). Pointers to these shared memory arrays can be created quickly and only where needed. Due to my use of the volatile keyword these pointers end up in a register providing efficient access. Absolutely try to avoid declaring the shared memory pointers at kernel level scope (e.g. at the very beginning of the kernel) or the pointer will stick around and use a register all of the time.

__global__ static void Kernel(unsigned int offset1, unsigned int offset2, ...)

{

	extern __shared__ unsigned char shared[];

	// I declare pointers to such arrays in LOCAL scope instead of at kernel level scope, whenever

	// I need them. 

	for (/* some computation loop */)

	{

		if (some condition)

		{

			// now declare this array pointer at local scope

			unsigned char (* volatile s_fading)[128] = (unsigned char (*)[128])&shared[offset1];

			// s_fading can now be accessed as a two dimensional uchar array of size [x][y] with y=128 

			// and x being unspecified (just reserve enough bytes of shared mem for the highest expected x)

			s_fading[sector_index][ue_thread] = ...

		} 

	  // s_fading is now out of scope again, this avoids the compiler to keep it around and

	  // waste a register

	}

}

Generally I try to avoid declaring too many variables at kernel level scope. In many cases it is more efficient to declare (and recompute) index variables several times within local scopes. You just donā€™t want to do this in the innermost loops (the ā€œhottestā€ code).

Feel free to append more tips how to reduce your register count.

Thanks, these are very useful tips.

cbuchner1,

Thanks for sharing. Iā€™d be interested to know if you got better performance and by how much (roughly).

Regards.

In my code I incur a significant per-block overhead (in terms of global memory access to read some configuration data which is too big to fit into constant memory). So I was able to speed up my code by about factor 2. A smaller grid means less overhead in my case.

Well this discussion goes back to 7 years ago, so it is likely to be out of interest now. If you still follow this discussion, I am a bit puzzled by the ā€˜volatile trickā€™. By declaring volatile, we are basically telling the compiler to locate the variable on global memory. Hence, an access penalty of about ~100x, but yeah, youā€™re right, now we have more registers available.
So, the question is which one is more beneficial? Allocating variables that are likely to be get called often on global memory and accepting the latency in accessing them OR reducing the number of threads per block so that the all registers can be fit in a block?

the trick only worked on very old, non-LLVM based CUDA compilers. CUDA 1.x and 2.x releases were still using the Open64 compiler infrastructure, and for some reasons declaring variables as volatile worked to some extent to reduce register pressure.

Currently I can not recommend this anymore.

For cuda 10.1 and Turing GPU, the volatile still works, it can reduce register number.

interesting. Likely thatā€™s because it will keep values around in the register for a shorter period of time because data will be fetched from its source location (e.g. global memory) every single time it is used.

So I think you might be trading register use vs. increased memory access here.

Maybe you can share some piece of code (a kernel) where this trick worked for you? Iā€™d love to poke around in it.

Christian

Declaring variables as volatile is likely interfering with various compiler optimizations. For example:

[1] It reduces the ā€œmobilityā€ of load instructions. Loads cannot be scheduled as early as they otherwise could and heir data ā€œcachedā€ in registers.

[2] It reduces the ability to extract common subexpressions, which might otherwise be computed once and ā€œcachedā€ in registers.

Since it is not exactly known how compiler optimizations will be affected, and the effect may differ by GPU architecture and compiler version, I consider this use of ā€˜volatileā€™ a hack (or an abuse, depending on my mood :-). The trade-off here is between per-thread efficiency and the total number of threads running.

Since the end result of reducing register usage in this way may be increased performance, this may still be a useful hack. But any code using it should be closely monitored as the code is maintained over time, as the brittle nature of the hack means it could at some point become a pessimization even though it started out as an optimization.

@njuffa @cbuchner1 It turns out the volatile trick is still incredibly useful even in the newest GPU and cuda version. In my experiments, I use an RTX 3090 GPU (Ampere) with cuda 11.1, and the simple volatile trick for temporary index variables makes my kernel 50% faster! It reduces the number of registers from 96 to 80. I have tried almost everything I can but the only only feasible way to alleviate register pressure is to use volatile keyword. I really hope feature nvcc compiler can be smarter and does not extract common subexpressions excessively. I am also eager to know if there are better ways to alleviate register pressure.

Some additional notes to the previous comment: Both applying the launch bound or limiting the register using --max-reg command do not help even if the they can lower the number of registers (but the performance gets little improvement). I donā€™t know why. In this respect, I would say the compiler is rather stupid.

It is not that the compiler is stupid, but that it currently lacks convenient ways to adjust the optimizations it performs at a fairly fine level of granularity, as compared to gcc for example. By default the CUDA compiler compiles code at -O3, with all optimizations turned to 10. Just like with host compilers, this can at times be counter-productive, leading to ā€œexcessiveā€ unrolling, or inlining, or something else. For example, my impression from multiple reports of performance regressions with the latest compilers (11.6) is that loop unrolling is a bit over the top compared to older versions (say, 11.1).

This whole scenario is made more complicated by the fact that the CUDA compiler is really two optimizing compilers bolted together, using PTX code as the interface. This can lead to impedance mismatches. That is a challenge that most host compilers do not have, caused by shifting hardware complexity (maintaining a stable ISA) to the software.

In practical terms, you could file an enhancement request with NVIDIA to provide more optimization control knobs for the compiler, and you could try reducing the ptxas optimization level with -Xptxas -O{1|2|3} (-O0 is really only useful for debug builds) to see whether that makes a difference to performance. Since ptxas is responsible for register allocation and machine instruction scheduling, it might.

I get it. I think a possible way might be to add such control nobs that tell the compiler to separate the kernel into blocks and does not extract intermediate variables that appear in multiple blocks. For example, a variety of kernels have the following pattern:

read data from global memory with the address set  S, e.g. read A[i] for i in S
process the data
write the outcome to global memory with the same address set S, e.g. write B[i]=f(A[i]) for i in S

Here S is the index set that may need complicated calculations and thus need a lot of registers to store the intermediate indices. Since these registers can be reused in the final writing stage, nvcc compiler typically stores them which causes register pressure. In this case, a better way is to re-calculate the index set S when writing. So I think we can tell the compiler to separate the kernel into blocks (or stages) so that these intermediate values between different stages are re-calculated.

TBH, that sounds quite speculative. The way forward on such issues is usually to first analyze the generated code in detail to find the exact root cause(s) of sub-optimal performance, then try to identify the optimization phase(s) contribute to that. From working with the CUDA compiler engineers back in the days, I seem to recall there are several dozen phases in total, and of course there are cross-dependencies that can lead to phase-ordering issues.

I am vaguely aware that the LLVM technology which forms the basis of the upper stages of the CUDA compilation pipeline has a number of internal knobs to control them. I am not up-to-date whether any of these are accessible for the normal CUDA user, or what it would take to do so, if not yet available.

You should also be prepared for NVIDIAā€™s engineers to come back with an assessment that they do not wish to provide external knobs to control individual optimization as gcc often does. That will come down to design philosophy. When I was in CPU design we called these ā€œchicken bitsā€ (control bits to disable various processor mechanisms), and over the years I came to the conclusion that one should not have them.

In that spirit, for starters, you might want to file just an ordinary performance bug against the compiler for NVIDIAā€™s compiler engineers to look at. I would expect that with the kind of performance difference you are observing it would be an interesting use case for them to look at.

1 Like

Thanks for your reply. Indeed the above is just my speculation, but I guess it is true because when I set ā€œvolatileā€ for these index variables, the performance improves dramatically. These index variables only appear at the beginning and in the end. I will try to create a minimal reproducible code and file a bug report later.

1 Like