Register usage of a device function for vector rotation

frank0734 · June 10, 2022, 10:04pm

Why does the following vector rotation code use a lot of registers (40-50 shown by nvcc using the -Xptxas -v flag)?

Background:

This is a device function defined here and called in the kernel here.
Compilation of the function itself says 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Compilation of the kernel function with and without the call to it shows a 40-50 register difference.
When called, the pointers norm, tau1, and tau2 point to global memory (does it matter in terms of register usage, since this will be inlined?).
Any tips to reduce register usage, by optimizing the function itself, or on the kernel side (say, by caching the arguments into local memory first before being used)?

__host__ __device__
static inline void
rot_to_global(const double *tau1, const double *tau2, const double *norm,
  const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
  qglobal[0] = qlocal[0]*norm[0] + qlocal[1]*tau1[0] + qlocal[2]*tau2[0];
  qglobal[1] = qlocal[0]*norm[1] + qlocal[1]*tau1[1] + qlocal[2]*tau2[1];
  qglobal[2] = qlocal[0]*norm[2] + qlocal[1]*tau1[2] + qlocal[2]*tau2[2];

  qglobal[3] = qlocal[3]*norm[0] + qlocal[4]*tau1[0] + qlocal[5]*tau2[0];
  qglobal[4] = qlocal[3]*norm[1] + qlocal[4]*tau1[1] + qlocal[5]*tau2[1];
  qglobal[5] = qlocal[3]*norm[2] + qlocal[4]*tau1[2] + qlocal[5]*tau2[2];

  qglobal[6] = qlocal[6];
  qglobal[7] = qlocal[7];
}

Robert_Crovella · June 11, 2022, 12:07am

reducing register usage won’t necessarily increase performance. The compiler makes its decisions in an attempt to maximize performance.

In order to avoid reloads of data, the compiler would need at least 18 registers.

The biggest single factor that is the difference between 18 and 40 registers usage is the compiler doing early loads of data that it will use later. This is a typical optimization done by the compiler. Because the compiler is doing early loads (earlier than necessary), it requires more registers to hold the data.

I guess the first thing I would do is to use launch bounds to gradually dial down the register usage limit, and see where the spill loads start. If the compiler is smart, it should be able to reorder its loads and “give you back” at least 12 registers, by switching from an early load to a late load, without converting to register spills.

Here is an example that I used for study:

$ cat t2036.cu
__host__ __device__
static inline void
rot_to_global(const double *tau1, const double *tau2, const double *norm,
  const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
  double temp1 = norm[0];
  double temp2 = tau1[0];
  double temp3 = tau2[0];
  double temp4 = qlocal[0];
  double temp5 = qlocal[1];
  double temp6 = qlocal[2];
  double temp7 = qlocal[3];
  double temp8 = qlocal[4];
  double temp9 = qlocal[5];
  qglobal[0] = temp4*temp1 + temp5*temp2 + temp6*temp3;
  qglobal[3] = temp7*temp1 + temp8*temp2 + temp9*temp3;
  temp1 = norm[1];
  temp2 = tau1[1];
  temp3 = tau2[1];
  qglobal[1] = temp4*temp1 + temp5*temp2 + temp6*temp3;
  qglobal[4] = temp7*temp1 + temp8*temp2 + temp9*temp3;
  temp1 = norm[2];
  temp2 = tau1[2];
  temp3 = tau2[2];
  qglobal[2] = temp4*temp1 + temp5*temp2 + temp6*temp3;
  qglobal[5] = temp7*temp1 + temp8*temp2 + temp9*temp3;

  qglobal[6] = qlocal[6];
  qglobal[7] = qlocal[7];
}


__global__ void k(const double *tau1, const double *tau2, const double *norm,
  const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
  rot_to_global(tau1, tau2, norm, qlocal, qglobal);
}

$ nvcc -c t2036.cu -Xptxas=-v -arch=sm_70
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z1kPKdS0_S0_S0_Pd' for 'sm_70'
ptxas info    : Function properties for _Z1kPKdS0_S0_S0_Pd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 392 bytes cmem[0]
$ nvcc -c t2036.cu -Xptxas=-v -arch=sm_70 -maxrregcount 32
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z1kPKdS0_S0_S0_Pd' for 'sm_70'
ptxas info    : Function properties for _Z1kPKdS0_S0_S0_Pd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 392 bytes cmem[0]
$

CUDA 11.4

According to my test case, the compiler is willing to give me back 8 registers, without instituting spills. The above refactoring of your code has no initial effect on register usage. I did it to clarify what I thought were the minimum registers needed to avoid unnecessary reloading of data.

To be very clear, I’m not suggesting this is a good idea. It may make your code run slower.

frank0734 · June 11, 2022, 12:16am

Thank you, Robert. I will try this out!

In my case, the register usage is nearly full (~245) and is causing larger runs to throw “too many resources are requested to launch” (anything largerthan nthreads * nblocks=512 * 256 on a A100)). This is why I’m a little desperate and looking for ways to reduce register.

frank0734 · June 11, 2022, 5:36pm

In my test, the new kernel itself does not reduce the register.

But I did notice that, one reason for high register usage was that the device function was used as a function pointer in a struct passed to the kernel. If I call the device function directly, the register usage goes down quite a lot (>25 fewer).

My guess is either

there is an overhead in calling the function contained in a struct, AND/OR
by wrapping the function pointer in a struct, the compiler does not know which device function will be called at compile time, and therefore does not take into account both the kernel and the device function when doing the optimization

njuffa · June 11, 2022, 6:13pm

Sounds like an XY-problem (we have been presented with Y, but the issue originates from the larger context X which has not been revealed here).

(1) Is the code of this device function in the same compilation unit as the code of the kernel that calls it?
(2) Does the build use separate compilation for the device code?
(3) If the answer to (2) is “yes”, is link-time optimization enabled?
(4) What happens if the inline attribute for rot_to_global is changed to __noinline__?

Robert_Crovella · June 11, 2022, 7:05pm

I said that already right here:

A function called via a pointer cannot be inlined in the general case, so I would certainly expect that observation.

frank0734 · June 11, 2022, 11:36pm

Hi njuffa,

The device function and kernel (and executable) are built in the same compilation.
I did not turn on link-time optimization (the -dlto option).
- After I turned -dlto on, the performance has a 4% improvement, but the real register usage went up (which limits the number of threads/blocks I can use); This is useful, and I will try to use it more often. Thank you.

My attempt to ue __noinline__ failed. After changing __host__ __device__ static inline void to __host__ __device__ static __noinline__ void, I got error

expected ‘;’ before ‘void’
 62 | __host__ __device__ static __noinline__ void
   |                                        ^~~~~
   |                                        ;

I have an old MX130 card and

$ nvcc --version
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

On the other hand, as I mentioned earlier, I think my problem was that I was passing a function pointer instead of directly calling a known function. After I changed to directly call the device function, the register usage went down quite a bit.
If function pointers are the real cause, I am thinking about generating a different kernel for each combination of the sub-functions. But that will make the code structure more complicated (or using macros or templates). Do you have any generic suggestions (on managing many different kernel versions)?

frank0734 · June 11, 2022, 11:37pm

Hi Robert. Sorry, I missed the first statement you said.

If function pointers are the real cause, I am thinking about generating a different kernel for each combination of the sub-functions. But that will make the code structure more complicated (or using macros or templates). Do you have any generic suggestions (on managing many kernel versions)?

njuffa · June 12, 2022, 12:35am

Not sure what is going on there. __noinline__ definitely is a valid attribute and works in device code. Maybe it is not applicable to __host__ __device__ code? Try changing the order, putting the __noinline__ attribute first.

The general picture here is that many of the important optimization strategies used by the compiler, from function inlining (made easier when inside the same compilation usage, or when using inline attributes, or when using link-time optimization), to loop unrolling (either automatic or via #pragma unroll), to early load scheduling (made easier by the use of __restrict__ pointers) tend to increase register usage. Since modern GPU architectures provide plentiful registers this is typically the best strategy to increasing performance, and the compiler typically makes the right trade-offs with regard to performance.

From what has been presented in this thread, it is not clear what is really driving register use on the order of 250 registers in the kernel. It is not really rot_to_global based on what has been discussed here. You would want to identify all major contributors to register pressure before brainstorming about potential source-level changes.

If the register use is driven up by automatic loop unrolling, try adding #pragma unroll 1 before innermost loops. This will likely have a noticeable negative affect on performance.

frank0734 · June 12, 2022, 12:55am

Thank you, njuffa.

I will try to identify the true bottleneck.

After previous experiments, I feel my compiler is doing a lot of optimization, which is excellent for production but makes debugging less straightforward.

For example, optimization also seems to be interfering with the --maxrregcount option. Before I set it, 250 register is used. After I specify it, even with --maxrregcount 255 it still spills. --maxrregcount seems to downgrades the optimization level.

Next, I plan to turn off optimization as much as possible to find the bottleneck. Does this make sense?

njuffa · June 12, 2022, 1:10am

It’s the other way around. The compiler uses a collection of carefully tuned heuristics to decide how to trade off register usage for performance. -maxrregcount throws a wrench (or a spanner, if you are British) into the works by imposing a hard limit on register use. Hilarity ensues.

-maxrregcount was invented as “quick fix” to address register pressure issues in the early days of CUDA when GPU architectures were starved for registers. I last used it maybe ten years ago, and I would strongly recommend not using it. __launch_bounds__ was then introduced as a somewhat more sophisticated and finer grained mechanism to impose register use limits. At this point, I wouldn’t not recommend using that either.

Programmer attempts at pushing down register usage usually result in lower performance in my observation. There may be isolated cases where it leads to success (better performance), but generally it just mucks up the compiler heuristics. FWIW, I am not clear why pushing down register usage is considered imperative here. Why not use the full 255 register? At the hardware level, register allocation typically has a granularity of greater than 1 anyhow, so pushing down register usage as reported by the compiler may not even make any difference once this number is rounded up to the next multiple of N registers by the hardware (you can play with this in the CUDA occupancy calculator).

Higher occupancy is not strongly correlated with higher performance.

frank0734 · June 12, 2022, 1:35am

My motivation is to be able to use more threads. Originally, I could only run up to “nthreads * nblocks = 512 * 256”.

After some experimenting, I can confirm that brutally setting the maxrregcount lower to allow using more threads does not produce appreciable gain in overall performance.

I still hope to reduce the register usage, though. So far, the success is from calling device function directly instead of from function pointers.

By using maxrregcount, I really meant and hoped to debug the real cause of high register usage (aside from optimization). But maybe setting -Xptxas -O1 is the better choice?

I really appreciate your explanation on the maxrregcount and launch bounds, and their history!

njuffa · June 12, 2022, 2:00am

Generally speaking, a good rule of thumb is to use between 128 and 256 threads per thread block (in increments of 32 threads), unless there is a compelling reason to chose other block dimensions.

GPU architectures supported by current CUDA versions provide either 32K or 64 K registers per thread block, so choosing 128 threads per thread block seems to be the way to go here, in that is allows the full set of 255 registers per thread across all supported architectures (>= CC 3.5). You can play with the occupancy calculator, but I don’t see how using 245 registers versus 255 registers per thread is going to make a difference. Experiments using 64 threads per thread block may be of interest in architectures that provide only 32K threads, as it is typically a good idea to have more than one thread block resident per SM.

Simply make the grid large enough to cover all data the kernel needs to process.

frank0734 · June 12, 2022, 2:11am

I was hoping to get it down to something like 128 registers (so that I can use double the threads). But maybe the kernel is too complicated in nature to achieve that…

All your explanation really helped me to understand the logic behind. Thank you, very much.

njuffa · June 12, 2022, 2:21am

If the kernel currently compiles to code using 245 registers, getting that down to 128 registers is pretty much Mission Impossible. Even if you managed to do that, the steps taken to achieve that goal would likely drastically reduce the performance of the generated code.

Since each GPU register comprises 32 bits, every double variable requires two registers. Therefore it is not unusual to see high register usage with double-precision computations. I would suggest spending some quality time with the CUDA profiler to let it pin-point the bottlenecks in the code its present form.

Topic		Replies	Views
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1463	March 25, 2022
Effect of launch bounds on register usage and spillage CUDA Programming and Performance	12	257	September 6, 2024
Used Registers vs Live Registers CUDA Programming and Performance	14	3436	June 28, 2020
Anyone help me with the link to the register usage break points? CUDA Programming and Performance	4	346	July 14, 2022
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	349	April 3, 2024
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5476	July 4, 2007
Once again for registry spills, performance and nvcc magic CUDA Programming and Performance	9	6392	September 6, 2013
Tricks to minimise reg usage ! CUDA Programming and Performance	13	16114	August 30, 2007
Use of register An odd problem CUDA Programming and Performance	12	2301	August 12, 2010
Register Usage of Kernels CUDA Programming and Performance	5	4577	September 23, 2013

Register usage of a device function for vector rotation

Related topics