Register usage of a device function for vector rotation

Why does the following vector rotation code use a lot of registers (40-50 shown by nvcc using the -Xptxas -v flag)?

Background:

  • This is a device function defined here and called in the kernel here.
  • Compilation of the function itself says 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  • Compilation of the kernel function with and without the call to it shows a 40-50 register difference.
  • When called, the pointers norm, tau1, and tau2 point to global memory (does it matter in terms of register usage, since this will be inlined?).
  • Any tips to reduce register usage, by optimizing the function itself, or on the kernel side (say, by caching the arguments into local memory first before being used)?
__host__ __device__
static inline void
rot_to_global(const double *tau1, const double *tau2, const double *norm,
  const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
  qglobal[0] = qlocal[0]*norm[0] + qlocal[1]*tau1[0] + qlocal[2]*tau2[0];
  qglobal[1] = qlocal[0]*norm[1] + qlocal[1]*tau1[1] + qlocal[2]*tau2[1];
  qglobal[2] = qlocal[0]*norm[2] + qlocal[1]*tau1[2] + qlocal[2]*tau2[2];

  qglobal[3] = qlocal[3]*norm[0] + qlocal[4]*tau1[0] + qlocal[5]*tau2[0];
  qglobal[4] = qlocal[3]*norm[1] + qlocal[4]*tau1[1] + qlocal[5]*tau2[1];
  qglobal[5] = qlocal[3]*norm[2] + qlocal[4]*tau1[2] + qlocal[5]*tau2[2];

  qglobal[6] = qlocal[6];
  qglobal[7] = qlocal[7];
}

reducing register usage won’t necessarily increase performance. The compiler makes its decisions in an attempt to maximize performance.

In order to avoid reloads of data, the compiler would need at least 18 registers.

The biggest single factor that is the difference between 18 and 40 registers usage is the compiler doing early loads of data that it will use later. This is a typical optimization done by the compiler. Because the compiler is doing early loads (earlier than necessary), it requires more registers to hold the data.

I guess the first thing I would do is to use launch bounds to gradually dial down the register usage limit, and see where the spill loads start. If the compiler is smart, it should be able to reorder its loads and “give you back” at least 12 registers, by switching from an early load to a late load, without converting to register spills.

Here is an example that I used for study:

$ cat t2036.cu
__host__ __device__
static inline void
rot_to_global(const double *tau1, const double *tau2, const double *norm,
  const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
  double temp1 = norm[0];
  double temp2 = tau1[0];
  double temp3 = tau2[0];
  double temp4 = qlocal[0];
  double temp5 = qlocal[1];
  double temp6 = qlocal[2];
  double temp7 = qlocal[3];
  double temp8 = qlocal[4];
  double temp9 = qlocal[5];
  qglobal[0] = temp4*temp1 + temp5*temp2 + temp6*temp3;
  qglobal[3] = temp7*temp1 + temp8*temp2 + temp9*temp3;
  temp1 = norm[1];
  temp2 = tau1[1];
  temp3 = tau2[1];
  qglobal[1] = temp4*temp1 + temp5*temp2 + temp6*temp3;
  qglobal[4] = temp7*temp1 + temp8*temp2 + temp9*temp3;
  temp1 = norm[2];
  temp2 = tau1[2];
  temp3 = tau2[2];
  qglobal[2] = temp4*temp1 + temp5*temp2 + temp6*temp3;
  qglobal[5] = temp7*temp1 + temp8*temp2 + temp9*temp3;

  qglobal[6] = qlocal[6];
  qglobal[7] = qlocal[7];
}


__global__ void k(const double *tau1, const double *tau2, const double *norm,
  const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
  rot_to_global(tau1, tau2, norm, qlocal, qglobal);
}

$ nvcc -c t2036.cu -Xptxas=-v -arch=sm_70
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z1kPKdS0_S0_S0_Pd' for 'sm_70'
ptxas info    : Function properties for _Z1kPKdS0_S0_S0_Pd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 392 bytes cmem[0]
$ nvcc -c t2036.cu -Xptxas=-v -arch=sm_70 -maxrregcount 32
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z1kPKdS0_S0_S0_Pd' for 'sm_70'
ptxas info    : Function properties for _Z1kPKdS0_S0_S0_Pd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 392 bytes cmem[0]
$

CUDA 11.4

According to my test case, the compiler is willing to give me back 8 registers, without instituting spills. The above refactoring of your code has no initial effect on register usage. I did it to clarify what I thought were the minimum registers needed to avoid unnecessary reloading of data.

To be very clear, I’m not suggesting this is a good idea. It may make your code run slower.

1 Like

Thank you, Robert. I will try this out!

In my case, the register usage is nearly full (~245) and is causing larger runs to throw “too many resources are requested to launch” (anything largerthan nthreads * nblocks=512 * 256 on a A100)). This is why I’m a little desperate and looking for ways to reduce register.

In my test, the new kernel itself does not reduce the register.

But I did notice that, one reason for high register usage was that the device function was used as a function pointer in a struct passed to the kernel. If I call the device function directly, the register usage goes down quite a lot (>25 fewer).

My guess is either

  1. there is an overhead in calling the function contained in a struct, AND/OR
  2. by wrapping the function pointer in a struct, the compiler does not know which device function will be called at compile time, and therefore does not take into account both the kernel and the device function when doing the optimization

Sounds like an XY-problem (we have been presented with Y, but the issue originates from the larger context X which has not been revealed here).

(1) Is the code of this device function in the same compilation unit as the code of the kernel that calls it?
(2) Does the build use separate compilation for the device code?
(3) If the answer to (2) is “yes”, is link-time optimization enabled?
(4) What happens if the inline attribute for rot_to_global is changed to __noinline__?

I said that already right here:

A function called via a pointer cannot be inlined in the general case, so I would certainly expect that observation.

1 Like

Hi njuffa,

  • The device function and kernel (and executable) are built in the same compilation.
  • I did not turn on link-time optimization (the -dlto option).
    • After I turned -dlto on, the performance has a 4% improvement, but the real register usage went up (which limits the number of threads/blocks I can use); This is useful, and I will try to use it more often. Thank you.
  • My attempt to ue __noinline__ failed. After changing __host__ __device__ static inline void to __host__ __device__ static __noinline__ void, I got error
    expected ‘;’ before ‘void’
     62 | __host__ __device__ static __noinline__ void
       |                                        ^~~~~
       |                                        ;
    
    I have an old MX130 card and
    $ nvcc --version
    Built on Mon_Nov_30_19:08:53_PST_2020
    Cuda compilation tools, release 11.2, V11.2.67
    Build cuda_11.2.r11.2/compiler.29373293_0
    
  • On the other hand, as I mentioned earlier, I think my problem was that I was passing a function pointer instead of directly calling a known function. After I changed to directly call the device function, the register usage went down quite a bit.
  • If function pointers are the real cause, I am thinking about generating a different kernel for each combination of the sub-functions. But that will make the code structure more complicated (or using macros or templates). Do you have any generic suggestions (on managing many different kernel versions)?

Hi Robert. Sorry, I missed the first statement you said.

If function pointers are the real cause, I am thinking about generating a different kernel for each combination of the sub-functions. But that will make the code structure more complicated (or using macros or templates). Do you have any generic suggestions (on managing many kernel versions)?

Not sure what is going on there. __noinline__ definitely is a valid attribute and works in device code. Maybe it is not applicable to __host__ __device__ code? Try changing the order, putting the __noinline__ attribute first.

The general picture here is that many of the important optimization strategies used by the compiler, from function inlining (made easier when inside the same compilation usage, or when using inline attributes, or when using link-time optimization), to loop unrolling (either automatic or via #pragma unroll), to early load scheduling (made easier by the use of __restrict__ pointers) tend to increase register usage. Since modern GPU architectures provide plentiful registers this is typically the best strategy to increasing performance, and the compiler typically makes the right trade-offs with regard to performance.

From what has been presented in this thread, it is not clear what is really driving register use on the order of 250 registers in the kernel. It is not really rot_to_global based on what has been discussed here. You would want to identify all major contributors to register pressure before brainstorming about potential source-level changes.

If the register use is driven up by automatic loop unrolling, try adding #pragma unroll 1 before innermost loops. This will likely have a noticeable negative affect on performance.

1 Like

Thank you, njuffa.

I will try to identify the true bottleneck.

After previous experiments, I feel my compiler is doing a lot of optimization, which is excellent for production but makes debugging less straightforward.

For example, optimization also seems to be interfering with the --maxrregcount option. Before I set it, 250 register is used. After I specify it, even with --maxrregcount 255 it still spills. --maxrregcount seems to downgrades the optimization level.

Next, I plan to turn off optimization as much as possible to find the bottleneck. Does this make sense?

It’s the other way around. The compiler uses a collection of carefully tuned heuristics to decide how to trade off register usage for performance. -maxrregcount throws a wrench (or a spanner, if you are British) into the works by imposing a hard limit on register use. Hilarity ensues.

-maxrregcount was invented as “quick fix” to address register pressure issues in the early days of CUDA when GPU architectures were starved for registers. I last used it maybe ten years ago, and I would strongly recommend not using it. __launch_bounds__ was then introduced as a somewhat more sophisticated and finer grained mechanism to impose register use limits. At this point, I wouldn’t not recommend using that either.

Programmer attempts at pushing down register usage usually result in lower performance in my observation. There may be isolated cases where it leads to success (better performance), but generally it just mucks up the compiler heuristics. FWIW, I am not clear why pushing down register usage is considered imperative here. Why not use the full 255 register? At the hardware level, register allocation typically has a granularity of greater than 1 anyhow, so pushing down register usage as reported by the compiler may not even make any difference once this number is rounded up to the next multiple of N registers by the hardware (you can play with this in the CUDA occupancy calculator).

Higher occupancy is not strongly correlated with higher performance.

My motivation is to be able to use more threads. Originally, I could only run up to “nthreads * nblocks = 512 * 256”.

After some experimenting, I can confirm that brutally setting the maxrregcount lower to allow using more threads does not produce appreciable gain in overall performance.

I still hope to reduce the register usage, though. So far, the success is from calling device function directly instead of from function pointers.

By using maxrregcount, I really meant and hoped to debug the real cause of high register usage (aside from optimization). But maybe setting -Xptxas -O1 is the better choice?

I really appreciate your explanation on the maxrregcount and launch bounds, and their history!

Generally speaking, a good rule of thumb is to use between 128 and 256 threads per thread block (in increments of 32 threads), unless there is a compelling reason to chose other block dimensions.

GPU architectures supported by current CUDA versions provide either 32K or 64 K registers per thread block, so choosing 128 threads per thread block seems to be the way to go here, in that is allows the full set of 255 registers per thread across all supported architectures (>= CC 3.5). You can play with the occupancy calculator, but I don’t see how using 245 registers versus 255 registers per thread is going to make a difference. Experiments using 64 threads per thread block may be of interest in architectures that provide only 32K threads, as it is typically a good idea to have more than one thread block resident per SM.

Simply make the grid large enough to cover all data the kernel needs to process.

I was hoping to get it down to something like 128 registers (so that I can use double the threads). But maybe the kernel is too complicated in nature to achieve that…

All your explanation really helped me to understand the logic behind. Thank you, very much.

If the kernel currently compiles to code using 245 registers, getting that down to 128 registers is pretty much Mission Impossible. Even if you managed to do that, the steps taken to achieve that goal would likely drastically reduce the performance of the generated code.

Since each GPU register comprises 32 bits, every double variable requires two registers. Therefore it is not unusual to see high register usage with double-precision computations. I would suggest spending some quality time with the CUDA profiler to let it pin-point the bottlenecks in the code its present form.

2 Likes