Hi,
Recently I started combining kernels in the hopes of increasing computation/gmem access. However it seems that the nvcc compiler is artificially bloating the register usage…
Originally I had:
[codebox]__global void func1_simpler(float *input1)
{
__shared__ float smem1[8][8];
smem1[tx][ty] = input1[index];
//do process 1 on smem1
float temp1 = smem1[tx][ty];
input1[index] = temp1;
}[/codebox]
it uses only 14 registers.
Then combining, I had
[codebox]__global void func1(float *input1, float *input2, float *out)
{
__shared__ float smem1[8][8];
smem1[tx][ty] = input1[index];
//do process 1 on smem1
float temp1 = smem1[tx][ty];
smem1[tx][ty] = input2[index];
//do process 1 on smem1
float temp2 = smem1[tx][ty];
input1[index] = temp1*temp2;
}[/codebox]
and needs 34 registers, so if I limit it to 16 or 32, it starts using lmem…
Unless I’m misunderstanding something, there is no need for it to use so many more registers? I haven’t gotten my hands dirty with decuda yet, but process1 is pretty complicated, so manually fixing it would probably be a nightmare.
Is there an easy way to fix this? Also sorry if this has already been addressed, I did a quick search and didn’t find a relevant post…