Register usage problem after static unroll(code generator)

Hello everybody.
I have stumbled into a problem regarding static unrolling. In order to improve performance I have created a code generator for my application. The code reads some values from shared memory, applies basic operations (barely over 1 operation for every value read) on them writing a final result back. This is repeated for a large number of iterations. Thus the unrolling helps to avoid calculation of memory addresses, a significant portion of the original code. I know that this is not optimal for CUDA, but still performance is worth it. The problem is that the generated code uses a huge number of registers. The ptx file even shows 3000. It seems that the compiler finds some reuse of values and then keeps all values in intermediary registers. Even is I specify maxregcount, the final cubin file will contain significant local memory usage and this lowers the performance quite a bit. What could I do to force compiler to reuse registers? Tried rewriting the code to match the assembly and force it to use my registers but no change, it is even slower.
Tried CUDA toolkit 2.1 and also 2.2.
Thanks in advance,
hlr

Look up the “volatile trick” in these forums, maybe it will help you.

Can you post a snippet of your code that has been generated?

That is becoming a pretty popular and effective trick :) … it should be part of the programming guide

Thanks…

NA

This would be akin to admitting that the optimizer of the NVCC / PTXAS is currently underperforming.

#define __register volatile

;-)

Christian

Thanks for the idea, at least now learned that this is normal behavior for the ptx assembly code. Unfortunately adding volatile didn’t help yet.

Here is a code snippet with two steps. I am working on 64 bit values that’s why it is a bit wierd.

[codebox]

temp=s_mem[0];

val.x=(temp>>32);

val.y=temp;

temp1.x=threadLocal[0];

temp1.y=threadLocal[0+89*16];

val.x^=temp1.x;

val.y^=temp1.y;

temp1.x=threadLocal[72];

temp1.y=threadLocal[72+89*16];

val.x^=temp1.x;

val.y^=temp1.y;

threadLocal[0]=val.x;

threadLocal[0+89*16]=val.y;

temp=s_mem[1];

val.x=(temp>>32);

val.y=temp;

temp1.x=threadLocal[16];

temp1.y=threadLocal[16+89*16];

val.x^=temp1.x;

val.y^=temp1.y;

temp1.x=threadLocal[88];

temp1.y=threadLocal[88+89*16];

val.x^=temp1.x;

val.y^=temp1.y;

threadLocal[16]=val.x;

threadLocal[16+89*16]=val.y;[/codebox]

Just a small note, if the compiler would execute the code exactly like you wrote it, then there are a lot of read-after-write dependencies and you require about 192 threads in a block to hide this.

N.

Yes, there are 256 threads per block. I forgot to mention that threadLocal is a precalculated pointer to the local memory + offset based on threadId.