Register usage problem

In my kernel I call the following code:

device inline void foo(unsigned char* dst, const unsigned short* src, unsigned short scale)
for (unsinged char i = 0; i < COUNT; i++)
dst[i] = (unsigned char)((src[i] * scale) >> 8);

If COUNT==1 my whole kernel uses 12 registers. But if I set COUNT==32 (which is the value I need) my register usage goes up to 50.

With #pragma unroll 1, the register usage only goes down to 47. Why so many registers, and how can I reduce the pressure? Currently my kernel is limited by register usage.

I am using CUDA 2.0.


Just a thought…but what happens if you declare the char outside of the for loop? Something like

[codebox]device inline void foo(unsigned char* dst, const unsigned short* src, unsigned short scale)


unsigned char i;

for ( i = 0; i < COUNT; i++)

dst[i] = (unsigned char)((src[i] * scale) >> 8);


Thanks for the suggestion. Unfortunately this doesn’t change anything.

I really need a solution to this problem. Does anyone else have an idea?


you can try to experiment with the --maxrregcount=xx compiler switch. This forces the compiler to limit the register usage to your desired value. Drawback of this is obviously that global memory might be used if there are no more registers.

I was once able to fool the compiler to not unroll a simple loop. If I remember right I did something like just reverse the order, which was enough to fool the compiler.

for (i=MAXCOUNT; --i; ) {do stuff}

I remember I had to play with it to make sure it wasn’t detected as a loop, and CUDA 1.1 acted different than 2.0.

If that loop still gets unrolled, you could try fancier stuff like making the compare some bitwise xor or something

for (i=0; (! (i ^ MAXCOUNT)); i++) {}

Note also that any hack like this will be fragile as you compile with different versions of the SDK and such, but it may lead you to some
“good enough for now” solution.

What happens if you manually unroll the loop?

If you choose to, you might find Boost::preprocessor interesting (see )

What happens when I is an int, not a char? I don’t think you’re saving much memory by having a char there as it’s aligned to 4B anyway.

Talking about that - is there an advantage of using one int as 2 shorts or 4 chars ? Or does this cause further use of many temporary registers for computation ? like


int xy=0;

while ( (xy>>8) < 100 ){ // y

while ( (xy&255) < 100 ){ // x

image[xy&255][xy>>8] = xy;






To unroll a loop or not can be set by a compiler option as follows (programming guide)

#pragma unroll

For example, in this code sample:

#pragma unroll 5

for (int i = 0; i < n; ++i)

the loop will be unrolled 5 times. It is up to the programmer to make sure that

unrolling will not affect the correctness of the program (which it might, in the abo

example, if n is smaller than 5).

#pragma unroll 1 will prevent the compiler from ever unrolling a loop.