Tricks to minimise reg usage !

Hi everyone,

As you all know minimising reg usage is important to have more threads executing simultaneously. It’s more important when kernel algorithm become a bit complex.

I found some little tricks to minimise them (without changing the algorithm), and would like everyone interested to participate and give the tricks he has found.

Thanks to some tricks I reduced from 29 to 19 the nb or register (given in the .cubin file) of my kernel

Here are some of mine :

First : the one with whidh I gained the most :

Operators must have been assigned before recurrent operations : this prevent the compiler from choosing different register when only two or three were enough.
(I don’t know if the register declaration helpf, but the operands declaration yes)
Example :


register float opRF;
register float opFilter;

            RFData = texfetch(texRFData_0, decal,yTex);
            opRF = RFData.x;
            opFilter = Fval.x;
            resX += opRF * opFilter;
            opRF = RFData.y;
            opFilter = Fval.y;
            resX += opRF * opFilter;
            opRF = RFData.z;
            opFilter = Fval.z;
            resX += opRF * opFilter;
            opRF = RFData.w;
            opFilter = Fval.w;
            resX += opRF * opFilter;

instead of :

           RFData = texfetch(texRFData_0, decal,yTex);
           
            resX += RFData.x * Fval.x;                
            resX += RFData.y * Fval.y;                
            resX += RFData.z * Fval.z;                
            resX += RFData.w * Fval.w;

Second : Minimise the argument numbers in the kernel call : make public constants when possible.

That’s all from now, do you have some tricks to share ?

Hi,

For me it helped a lot making kernels less generic. This was very specific to loop sizes.

First I had a loop that was dependent upon an argument of the called kernel. After having a look a bit to the algorithm one could see that the loop only took 3 different values, hard-coding this loops into three different functions yielded a reduction from 13 to 5 registers.

Original loop:

       act_size+=Nk;

        while ( act_size < expandedKeySize )

        {

          //compute

          act_size+=4;

        }        

Replacement loop:

        for ( i = 4; i < 176/4; i++ )

        {

             //Computation

        }

I guess this improvement was quite obvious…

Here is my tip: If your algorithm can run with a variable block size, try hardcoding that block size with a const int so that it can be determined at compile time instead of using blockDim. Sure, this means that you must recompile every time you want to test a new block size, but one of my kernels went from 26 to 19 registers used with this trick. Incidentally, the block size was being used in two loop conditions, so this tip may just be the same as esteve’s, but I mention it anyways.

My tip is the following: remove loops if you can. I was able to go from 25 to below 10.

Note that you must have less than 10 registers to get 100 percent occupancy. Any more than

10 registers makes this impossible.

Gordon

My tip is to use the “–maxrregcount” option to ptxas, as described on page 20 of the nvcc manual (beta-01). This allows almost complete control of the number of registers, although the number used seems to be 1 or 2 more than the number you specify (it may be that some uses are not regarded as ‘temporary’.

This does not necessarily imply better performance than the other options discussed here, which may reduce unnecessary register use.

My kernel fails to run with --maxrregcount :( That option just use local memory in place of registers anyway.

My trick is to store stuff in shared memory (instead of in registers) using volatile pointers (without volatile, the compiler just “optimize” everything back). This also prevents compiler from doing unnecessary common subexpression (since anything less than -O2 won’t compile at all).
Example:


Instead of something like:

unsigned int b0,b1,b2;
xp[0]=texfetch(tree,p+2-(b0>>31u));
xp[1]=texfetch(tree,p+4-(b1>>31u));
xp[2]=texfetch(tree,p+6-(b2>>31u));

use:

extern shared int sh;
unsigned int volaitle *b=(unsigned int volaitle *)sh;
xp[0]=texfetch(tree,p+2-(b[0]>>31u));
xp[1]=texfetch(tree,p+4-(b[1]>>31u));
xp[2]=texfetch(tree,p+6-(b[2]>>31u));

this prevents compiler from doing CSE on b?>>31u, and b[?] almost won’t consume registers at all.

sorry for the silly question, but were can u find out how much registers you are using ?

In the .cubin file, compile with nvcc -cubin instead of -o . As you would do it for ptx files. And then look for reg= on the kernel you want.

For algorithms that have a lot of state or for variables that are used across loops or code blocks, I had success in reducing the number of registers by forcing the variables to shared memory. This comes at the cost of some st.shared / ld.shared calls but allows better reuse of registers in the code segments between the variable accesses.

Peter

Hi there,

I got something nice which saved me some 4 registers (according to my .cubin file). I guess

it applies in many cases - and hopefully the compiler is able to recognize this in the future…

__device__

float getFilter( float3 f3R )

{

  float fSqrLen;  // squared length of vector

  float fRetVal;  // return value

 // compute square length of vector

  fSqrLen = f3R.x * f3R.x + f3R.y * f3R.y + f3R.z * f3R.z;

 // compare squared vector length with two constants, where

  // c_fMaxSqrLen is located in constant memory

  if ((fSqrLen < 0.001) || (fSqrLen > c_fMaxSqrLen))

  {

    fRetVal = 0.0;

  }

  else

  {

    fRetVal = 1.0;

  }

 // return value

  return fRetVal;

}

It seems that it allocates a lot of registers to test this if statement. My changed code

looked like this:

__device__

float getFilter( float3 f3R )

{

  float fSqrLen;  // squared length of vector

  float fRetVal;  // return value

  float fT1, fT2; // temporary variables

 // compute square length of vector

  fSqrLen = f3R.x * f3R.x + f3R.y * f3R.y + f3R.z * f3R.z;

 // ------------- CHANGED ----------------------------

  // preload constants

  fT1 = 0.001;

  fT2 = c_fMaxSqrLen;

 // compare squared vector length with two registers

  if ((fSqrLen < fT1) || (fSqrLen > fT2))

  {

    fRetVal = 0.0;

  }

  else

  {

    fRetVal = 1.0;

  }

 // return value

  return fRetVal;

}

This looks really trivial, but it saved 4 (!) registers!

Hope it helps some of you,

Jake

hey sorry to float this old post but can some one explain why loading the const data to temporary registers lowers the reg count ??? seems more logical it would increase.

Thanks

Another thing I’m concerned is, how performance improve/ decrease when we use less / more register. I think it should be faster when we use register instead of shared memory .

It depends on what you do. Less registers result in more threads. This may result in more latency hidden, and may outweight the sacrifice of putting things in shared memory. If the latency is already well hidden, using more registers may be a better idea.

One thing that reduces register usage for my carefully tuned kernels is to turn off optimizations with “nvcc --ptxas-options -O0”

Also, as a general rule, look very carefully at the .cubin and .ptx files. The compiler often does very dumb things, like moving things into local mem (TERRIBLY slow compared to shared). Other things include putting arrays that you’d like as registers into local instead. Also, it might insert extra address calc instructions where they aren’t needed.

unrolling loops and using static addressing (ie a[10] instead of a[i]) can help a lot. The compiler seems to have little awareness of when it can unroll loops or avoid address calcs. The C Preprocessor comes in very handy for this task (remember the ‘##’ concatenation operator!).

also, --maxrregcount seems like a bad idea because it again puts things into local instead of really affecting the optimization algorithm.

Basically, just keep your eye on ptx/cubin files and then jiggle your code until it looks like you’re getting the right result. Unfortunately, that’s the best bet.

Actually, many of the above techniques will increase register usage. As someone said, decreasing registers is often not the ultimate goal. Let me make this clear: for math-intensive algorithms, you’ll get much better performance using your registers. For bandwidth-intensive algorithsm, you’ll be better off decreasing reg usage by going to shared mem. This is so more memory fetches can be running in parallel.