Register usage How good is the compiler?

Hello all,
I have a quite complex function and I want to reduce the number of registers it uses.

Without touching anything, it compiles to 42 registers. I have tried offloading some of the registers to shared memory, but I can’t get lower than 38 registers (and I have enough shared memory to offload 15 registers in there).

Of course, if I use maxrregcount, I can reduce the number of registers used, but at the expense of local memory and in this case the performance is hurt and not improved.

The PTX is quite complex to follow as the function is so big, and before failing back to straining my eyes in there to see if I can do it better than the compiler, I would like to know if it is worth it.

So, in your experience, how well has the compiler performed in keeping register usage to the minimum? Can it be improved by aiding it in cuda code? Is it worth to use raw PTX in these cases?

Thanks.

Hi there,

I’m having the same trouble as you with the register usage. I’m wondering how you offloaded registers to shared memory as my strategy to declare all threadlocal variables as shared has had no effect on the register usage. Looking to the ptx file the compiler actually loads these shared memories once to registers and only uses these registers from there.
I don’t see any .local directives in the ptx file so I have no clue which variables it ships off to the local memory in case I push him with the -maxrregcount.

With some values for -maxrregcount I got performance increasing effects, although it utilizes local mem then.

Thanks.

Johannes

If you add the volatile keyword, does that force the compiler to keep the variable in shared memory? That is:

volatile shared float x;

I believe it forces the compiler to fetch the value from shared memory into a register again when accessed a second (or later) time. But when using a value from shared memory, you still need a register for the calculation, it is just that the register can be re-used again, even when later-on the same shared memory location is used.

Register offloading is done this way:

extern __shared__ float sfregs[];

#define SREG1    (sfregs[threadIdx.x*15])

#define SREG2    (sfregs[threadIdx.x*15]+1)

...

__device__ void kernel( ... ) {

  SREG1 = 2.0f;

   SREG2 = SREG1 * 4.0f;

}

So I guess there is no need for extra registers, apart from the ones used to calculate the index in shared memory.

Infact, as far as I understand, if you declare a shared variable, it is infact shared between all the threads in the block, so it isn’t useful for register offloading. Correct me if I’m wrong please.

Going back to the topic, anyone can share his/her experience dealing with register usage reduction?

Thanks.

I have only ever gotten a benefit with this technique when the shared variable truly is shared (i.e. constant across all threads in a block). Even then, the reduction in registers has been quite minimal as the compiler still loads the value into a register for performing computations with it. And the code complexity skyrockets when you need to update one of these variables because of all the __syncthreads().

The times I have tried to offload 1 register per thread, my register usage has either stayed the same or gone up. I have always seen performance decrease, so I don’t try this method any more. Perhaps, with a close examination of the ptx that the compiler produces along with some playing around with compiler options, one could get a benefit, but I’m skeptical.

I’m currently trying to offload variables to shared memory in order to have no local memory usage when compiling with -maxrregcount = 20 or other values.

Another possible approach could be, to hide common expressions in computations. That way the compiler reserves no register for a temporal result, but computes it everytime again. Only applicable of course if you have ALU cycles to spare.