Why using a break during a loop can save many register usage?

I’ve got a loop inside kernel. If the condition check fails during the loop, I use break to quit the loop. Using break unexpectedly save me 6 registers per thread. I don’t know why…

It will probably be the result of a compiler optimisation, although it is impossible to say more than that without seeing code. The obvious candidates are dead code removal or some sort of result or function substitution, but that is just pure speculation.

Break disables loop unrolling. Thus less registers

The code relevant is as follow. I’m trying to stop the loop if negtive found. Actually the break is optional. The reason I want to add it is that as long as negtive found during the early iterations, we can stop and save time. But actually this also reports 6 registers when I compile.

shared unsigned char bv[16384];

shared int negtive_flag[512];

for(l = 0; l < 6; l++)

if (!(bv[(h[l]>>3)] & (1<<(7-(h[l]&0x7))))){

      negtive_flag[threadIdx.x] = 1;

      <b>break</b>;

 }

but what if during the loop, “break” is never reached? I mean the condition check never true all through the loops. The I guess it should use as many registers as the case when I don’t use break inside loop. How does the compiler know whether there is really a break that disables the loop unrolling before real run?

Yep, Sergey nailed it. Without the break statement, that loop has a small, known trip count and nvcc will automatically unroll it. With the break in place, the loop trip count is unknown and the loop unroll won’t happen. That is where the register savings are coming from.

On a side note, I still don’t understand why a break statement disables loop unrolling - wouldn’t it be trivial to implement with a jump out of the unrolled code?

Got ya~ So you mean the automatical unroll of loop with known trip count is finished by the optimized compiler beyond our consciousness. That’s why in this case, it makes no different if we manually unroll or not in terms of register usage.

for (i=0; i<6;i++)

A[i]=A[i]*10;

seemingly uses the same registers as

A[0]=A[0]*10;

A[1]=A[1]*10;

A[2]=A[2]*10;

A[3]=A[3]*10;

A[4]=A[4]*10;

A[5]=A[5]*10;

because it’s already been optimized to unroll during the compile. Is this correct understanding?

Yep, that is pretty much it. You should be able to see the effect if you look at the compiled PTX from nvcc.

Thanks avidday! But how to see this different quantitatively from internal nvcc?

Unrolling the loops is actually increasing the use of registers. What I want to see is what’s the difference before unrolling and after unrolling quantitatively?

You can disable loop unrolling by putting:

#pragma unroll 1

immediately before a loop you do not want to unroll. You can also do partial unrolling with the same option. See Appendix E.2 in the CUDA Programming Guide for details.