Forcing reduced register usage?

Hi,

Recently I started combining kernels in the hopes of increasing computation/gmem access. However it seems that the nvcc compiler is artificially bloating the register usage…

Originally I had:

[codebox]__global void func1_simpler(float *input1)

{

 __shared__ float smem1[8][8];

smem1[tx][ty] = input1[index];

//do process 1 on smem1

float temp1 = smem1[tx][ty];

input1[index] = temp1;

}[/codebox]

it uses only 14 registers.

Then combining, I had

[codebox]__global void func1(float *input1, float *input2, float *out)

{

 __shared__ float smem1[8][8];

smem1[tx][ty] = input1[index];

//do process 1 on smem1

float temp1 = smem1[tx][ty];

smem1[tx][ty] = input2[index];

//do process 1 on smem1

float temp2 = smem1[tx][ty];

input1[index] = temp1*temp2;

}[/codebox]

and needs 34 registers, so if I limit it to 16 or 32, it starts using lmem…

Unless I’m misunderstanding something, there is no need for it to use so many more registers? I haven’t gotten my hands dirty with decuda yet, but process1 is pretty complicated, so manually fixing it would probably be a nightmare.

Is there an easy way to fix this? Also sorry if this has already been addressed, I did a quick search and didn’t find a relevant post…

Check your variable initialization in second process 1.

unfortunately, there is. 20-24 cycles of register read-after-write latency. compiler tries to avoid it by using more registers than the necessary minimum. (or at least it’s what I think it thinks ;)). I’ve seen it several times in decuda.

pseudocode example (not correct actually - not taking into account memory latency, pointer registers etc.):

[codebox]a=in[0]

b=in[1]

c=a+1

d=b+1

out[0]=c

out[1]=d[/codebox]

so, minimal is 2 registers:

[codebox]r0=in[0]

r1=r0+1 //20-4=16 latency

out[0]=r1

r0=in[1]

r1=r0+1 //additional 16 cycles of latency

out[1]=r1[/codebox]

which is 2 registers, but 32 cycles of latency

the compiler will do something like:

[codebox]r0=in[0]

r1=in[1]

r2=r0+1 //20-2*4=12 cycles of latency

r3=r1+1 //20-3*4=8, but already covered by above!

out[0]=r2

out[1]=r3[/codebox]

which is 4 registers, but only 12 cycles of latency

I guess it works fine in compute-bound kernels, but really sucks in memory-bound ones. The catch is, the compiler doesn’t know what king of kernel it is compiling. (and I would really like to be able to turn it off anyway, just to check if I can do better).

I am reusing the same variables, without initializing them… But since when the code is initially converted into ptxas, all the registers are, in a sense, expanded until you are using hundreds, and then the compiler optimizes, so it wouldn’t matter (I think).

I saw some issues in relation to the latency stuff when searching on this topic, but it seemed inconclusive :P. Even considering that, when I add even a third one, the register usage increases even more… but the latency should only limit it to maybe at most 30…

Sounds like I have to do it manually D:

Check your code first. In most cases compiler uses minimal number of registers.

Also you need to check if you init your variables in all branches. If you reproduce the function, it may reuses variables and they could need registers to keep.