Register Allocation Woes Strange register allocation

Hey. I have a kernel which has the following structure:

global void my_lil_kernel(params)
{

// do simple global to global memory operation

__syncthreads();

function1(params);

function2(params);

}

device void function1(params)
{
int i;
for (i=0; i < 8; i++)
{
// Do stuff
__syncthreads();
}
}

device void function2(params)
{
// Do stuff
}

When the calls to functions 1 and 2 are commented out in the kernel, the kernel uses 6 registers. When function1 is called, but the call to function 2 is commented out, the kernel uses 12 registers (I’m confident I can shrink this down to 10…again…) When the call to function 1 is commented out, and the function 2 is called, the kernel uses 6 registers. However, when both functions 1 and 2 are called, the kernel will use 14 registers, which is very strange.

I would think that after function 1, all registers used in function 1 could be automatically re-used, since all the threads in the block are guarenteed to be finished with function 1 by the time function 2 is called. Are there any coding practices or compiler directives I could implement to reduce this strange register allocation behavior? I would prefer not to use local memory in order to get 100% occupancy.

Thanks!

Well, functions are inlined, so commenting function2 may mean that part of function1 is optimized away.
Also it is not necessary to program for 100% occupancy, only when your kernel is not memory-bandwith bound you need to try to get more occupancy.

I even saw code where additional, more complex structures forbid some optimizations. That saved registers and the code ran faster because of higher occupancy! That of course onyl because the memory latency was better hidden.

With memory bound you mean, the code uses 100% memory bandwidth or do you mean it uses not the full bandwidth but all arithmetic capabilities?

Johannes

Unfortunately, my code involves quite a few global memory reads and writes. I use shared memory for the data which is small enough to fit and is re-used, but for other data, that isn’t a realistic option. Even though my kernel runs in 1.4 milliseconds, it still hurts when trying to call it a billion times :P So I’m really trying to squeeze every last bit of performance out of it.

Anyway, I’m just surprised that compiler optimizations would cross __syncthreads() like that. I guess that’s the important lesson I learned here, thanks!

66% occupancy is a good target for memory bound kernels. In my experience, pushing past 66% only increases performance less than 1%.