#pragma unroll?

Hi all,
I’ve seen the #pragma unroll X for unroll loop. Can you tell me if that changes anything? If yes, what is a “good” value for unroll?

a good value is the minimum amount of loops that will be done for that for-loop

#pragma unroll 11

for(int k=0;k<10;k++) {



will produce code that does not what you expect.


#pragma unroll 11

for(int k=0;k<numloops;k++) {



will be good if numloops>=11

It’s also worth keeping in mind that the compiler will automatically unroll any small loop if it can figure out how many iterations it will have at compile time. This can either be a loop limit that is a constant, or a template parameter. In these cases, you don’t even need to specify #pragma unroll.

It works like this:

A branch instruction at the end of FOR loop does NOT achieve anything.

Lets say your FOR loop iterates over 1000 times. Then you have 1000 instructions that are waste. Lets complicate it further saying that your FOR loop body actually executes only 3 instructions. Thus 25% of your FOR loop time is spent in branching only… And, if this FOR loop runs many iterations then you need to un-roll the loop to make it effective.

However, les say the FOR loop executes 99 instructions. Then only 1% of your FOR loop time is wasted in branching. Again, if your FOR loop executes say for 100 milli seconds, then you can save 1ms by completely un-rolling it. But thats NOT a lot when compared to the code- expansion you are going to witness. In such cases, one should NOT unroll.

You can work out the MATH for the FOR loop iterations and the for-loop body you have…

[b]Note that – you ought to be careful with the value of “unroll”.

Lets say you ahve a loop like this:

#pragma unroll 5

for (i=0; i<n; i++)

Say that at run-time “n” has a value that is NOT divisible by 5, THEN your CODE’s CORRECTNESS will Break…[/b]

Be careful while using it…

aka NaatuMokka…

Not Quite, My dear friend from the Land of Dykes,

numloops have to be DIVISIBLE by 11. Otherwise, Expect wrong results.

Man from the peninsula.

The #pragma unroll instruction must be inserted just before the loopor in the program begining???
It is a preprocessor instruction? Does the unroll be definied for the whole program?

UNROLL has to be immediately above the FOR loop. The programming guide has a section for it. Just search for “unroll”

Are you sure? Then I have to recheck my code when I change my blocksize…

The manual says so. Kindly refer the “Unroll” section in the CUDA manual.

So, it does not need to be a multiple, it just needs to be bigger.

Hmm… Thats just an example to illustrate the point.

Why I say this is:

Less say -> You unroll an “n” iteration loop by “m” times for any arbitrary “m”.

Now, After unrolling the loop – the compiler would have to place an “CMP” statement before each unroll to make sure that unroll is NOT overshooting.

  1. This CMP statement will remove any performance advantage that UNROLL brings in.

  2. If the compiler indeed generated “CMP” statements then the example quoted above should work when “n” is < 5.

Isn’t it?

#pragma unroll M
for (k=0 ; k< N ; k++)

delivers working code for M<=N. That is the only point I am trying to make. That is what is exactly written in the manual. Not for M = n*N with n = 1,2,3,4,5,… (that will also generate working code, but it is not necessary)

What happens as far as I understand is:

#pragma unroll 3
for (k=0;k<5;k++)

will be turned into




for(k = 3;k<5;k++)


You are right. I was wrong. I just verified it by writing a FOR loop and examined the PTX.

The compiler first divides the run-time “n” value by “m” (unroll factor) and then runs a FOR loop for “N/M” times in which the FOR loop body is expanded “M” times…

And, then it runs a FOR loop for “N%M” times…

I aplogize for mis-directing people over here.

Denis was right.

Best Regards,

well, that is interesting to know. I had always thought it only unrolled M times, and then performed a for loop for N-M times. What you are saying is much smarter from the compiler. That is very useful when your minimal value of N is much smaller than the maximal value.

hats of to you btw for being able to follow ptx code, I never came further than the second line in my kernel (tid = threadIdx.x + (blockIdx.x * blockDim.x))

It’s just a question, but how many time can we expect to gain by using pragma unroll?
Is anybody gain a lot?

I once got around 1.3x faster I guess… It all depends on how timeconsuming the FOR loop is and how big is the body of the FOR loop.

I think I wrote about this in a post sometime recently…

Ok, First of all, for a FOR loop – there are M instructions that form the body of FOR loop and 2 instructions (CMP and BRANCH) that are redundant… THey dont constitute to computing…

So, now find out the ratio of between the useless instructions and the FOR loop body… Now multiply this ratio with the total-time taken by FOR loop. That amount of time is a WASTE. You reduce this time by un-rolling. You basically reduce this ratio by increasing the number of useful instructions…

So, you have to work out the math for your FOR loop and decide on what is best. Good Luck