device inline void foo(unsigned char* dst, const unsigned short* src, unsigned short scale)
{
for (unsinged char i = 0; i < COUNT; i++)
dst[i] = (unsigned char)((src[i] * scale) >> 8);
}
If COUNT==1 my whole kernel uses 12 registers. But if I set COUNT==32 (which is the value I need) my register usage goes up to 50.
With #pragma unroll 1, the register usage only goes down to 47. Why so many registers, and how can I reduce the pressure? Currently my kernel is limited by register usage.
you can try to experiment with the --maxrregcount=xx compiler switch. This forces the compiler to limit the register usage to your desired value. Drawback of this is obviously that global memory might be used if there are no more registers.
I was once able to fool the compiler to not unroll a simple loop. If I remember right I did something like just reverse the order, which was enough to fool the compiler.
for (i=MAXCOUNT; --i; ) {do stuff}
I remember I had to play with it to make sure it wasn’t detected as a loop, and CUDA 1.1 acted different than 2.0.
If that loop still gets unrolled, you could try fancier stuff like making the compare some bitwise xor or something
for (i=0; (! (i ^ MAXCOUNT)); i++) {}
Note also that any hack like this will be fragile as you compile with different versions of the SDK and such, but it may lead you to some
“good enough for now” solution.
Talking about that - is there an advantage of using one int as 2 shorts or 4 chars ? Or does this cause further use of many temporary registers for computation ? like
[codebox]
int xy=0;
while ( (xy>>8) < 100 ){ // y
while ( (xy&255) < 100 ){ // x
image[xy&255][xy>>8] = xy;
xy++;
}
xy+=256;
}
[/codebox]
To unroll a loop or not can be set by a compiler option as follows (programming guide)
#pragma unroll
For example, in this code sample:
#pragma unroll 5
for (int i = 0; i < n; ++i)
the loop will be unrolled 5 times. It is up to the programmer to make sure that
unrolling will not affect the correctness of the program (which it might, in the abo
example, if n is smaller than 5).
#pragma unroll 1 will prevent the compiler from ever unrolling a loop.