reducing register usage wonât necessarily increase performance. The compiler makes its decisions in an attempt to maximize performance.
In order to avoid reloads of data, the compiler would need at least 18 registers.
The biggest single factor that is the difference between 18 and 40 registers usage is the compiler doing early loads of data that it will use later. This is a typical optimization done by the compiler. Because the compiler is doing early loads (earlier than necessary), it requires more registers to hold the data.
I guess the first thing I would do is to use launch bounds to gradually dial down the register usage limit, and see where the spill loads start. If the compiler is smart, it should be able to reorder its loads and âgive you backâ at least 12 registers, by switching from an early load to a late load, without converting to register spills.
Here is an example that I used for study:
$ cat t2036.cu
__host__ __device__
static inline void
rot_to_global(const double *tau1, const double *tau2, const double *norm,
const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
double temp1 = norm[0];
double temp2 = tau1[0];
double temp3 = tau2[0];
double temp4 = qlocal[0];
double temp5 = qlocal[1];
double temp6 = qlocal[2];
double temp7 = qlocal[3];
double temp8 = qlocal[4];
double temp9 = qlocal[5];
qglobal[0] = temp4*temp1 + temp5*temp2 + temp6*temp3;
qglobal[3] = temp7*temp1 + temp8*temp2 + temp9*temp3;
temp1 = norm[1];
temp2 = tau1[1];
temp3 = tau2[1];
qglobal[1] = temp4*temp1 + temp5*temp2 + temp6*temp3;
qglobal[4] = temp7*temp1 + temp8*temp2 + temp9*temp3;
temp1 = norm[2];
temp2 = tau1[2];
temp3 = tau2[2];
qglobal[2] = temp4*temp1 + temp5*temp2 + temp6*temp3;
qglobal[5] = temp7*temp1 + temp8*temp2 + temp9*temp3;
qglobal[6] = qlocal[6];
qglobal[7] = qlocal[7];
}
__global__ void k(const double *tau1, const double *tau2, const double *norm,
const double *__restrict__ qlocal, double *__restrict__ qglobal)
{
rot_to_global(tau1, tau2, norm, qlocal, qglobal);
}
$ nvcc -c t2036.cu -Xptxas=-v -arch=sm_70
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z1kPKdS0_S0_S0_Pd' for 'sm_70'
ptxas info : Function properties for _Z1kPKdS0_S0_S0_Pd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 392 bytes cmem[0]
$ nvcc -c t2036.cu -Xptxas=-v -arch=sm_70 -maxrregcount 32
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z1kPKdS0_S0_S0_Pd' for 'sm_70'
ptxas info : Function properties for _Z1kPKdS0_S0_S0_Pd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 392 bytes cmem[0]
$
CUDA 11.4
According to my test case, the compiler is willing to give me back 8 registers, without instituting spills. The above refactoring of your code has no initial effect on register usage. I did it to clarify what I thought were the minimum registers needed to avoid unnecessary reloading of data.
To be very clear, Iâm not suggesting this is a good idea. It may make your code run slower.