#pragma unroll?

garciav · March 18, 2008, 5:13pm

Hi all,
I’ve seen the #pragma unroll X for unroll loop. Can you tell me if that changes anything? If yes, what is a “good” value for unroll?
Thanks,
Vince

DenisR · March 18, 2008, 6:05pm

a good value is the minimum amount of loops that will be done for that for-loop

#pragma unroll 11

for(int k=0;k<10;k++) {

stuff

}

will produce code that does not what you expect.

so

#pragma unroll 11

for(int k=0;k<numloops;k++) {

stuff

}

will be good if numloops>=11

seibert · March 19, 2008, 1:13am

It’s also worth keeping in mind that the compiler will automatically unroll any small loop if it can figure out how many iterations it will have at compile time. This can either be a loop limit that is a constant, or a template parameter. In these cases, you don’t even need to specify #pragma unroll.

Sarnath · March 19, 2008, 10:54am

It works like this:

A branch instruction at the end of FOR loop does NOT achieve anything.

Lets say your FOR loop iterates over 1000 times. Then you have 1000 instructions that are waste. Lets complicate it further saying that your FOR loop body actually executes only 3 instructions. Thus 25% of your FOR loop time is spent in branching only… And, if this FOR loop runs many iterations then you need to un-roll the loop to make it effective.

However, les say the FOR loop executes 99 instructions. Then only 1% of your FOR loop time is wasted in branching. Again, if your FOR loop executes say for 100 milli seconds, then you can save 1ms by completely un-rolling it. But thats NOT a lot when compared to the code- expansion you are going to witness. In such cases, one should NOT unroll.

You can work out the MATH for the FOR loop iterations and the for-loop body you have…

[b]Note that – you ought to be careful with the value of “unroll”.

Lets say you ahve a loop like this:

#pragma unroll 5

for (i=0; i<n; i++)

Say that at run-time “n” has a value that is NOT divisible by 5, THEN your CODE’s CORRECTNESS will Break…[/b]

Be careful while using it…

aka NaatuMokka…

Sarnath · March 19, 2008, 11:30am

Not Quite, My dear friend from the Land of Dykes,

numloops have to be DIVISIBLE by 11. Otherwise, Expect wrong results.

Man from the peninsula.

garciav · March 19, 2008, 3:33pm

The #pragma unroll instruction must be inserted just before the loopor in the program begining???
It is a preprocessor instruction? Does the unroll be definied for the whole program?

Sarnath · March 19, 2008, 3:39pm

UNROLL has to be immediately above the FOR loop. The programming guide has a section for it. Just search for “unroll”

DenisR · March 19, 2008, 4:54pm

Are you sure? Then I have to recheck my code when I change my blocksize…

Sarnath · March 20, 2008, 6:55am

The manual says so. Kindly refer the “Unroll” section in the CUDA manual.

DenisR · March 20, 2008, 7:35am

By default, the compiler unrolls small loops with a known trip count. The #pragma unroll directive however can be used to control unrolling of any given loop. It must be placed immediately before the loop and only applies to that loop. It is optionally followed by a number that specifies how many times the loop must be unrolled.

For example, in this code sample:

#pragma unroll 5

for (int i = 0; i < n; ++i)

the loop will be unrolled 5 times. It is up to the programmer to make sure that unrolling will not affect the correctness of the program (which it might, in the above example, if n is smaller than 5).

#pragma unroll 1 will prevent the compiler from ever unrolling a loop.

If no number is specified after #pragma unroll, the loop is completely unrolled if its trip count is constant, otherwise it is not unrolled at all.

So, it does not need to be a multiple, it just needs to be bigger.

Sarnath · March 20, 2008, 8:45am

Hmm… Thats just an example to illustrate the point.

Why I say this is:

Less say → You unroll an “n” iteration loop by “m” times for any arbitrary “m”.

Now, After unrolling the loop – the compiler would have to place an “CMP” statement before each unroll to make sure that unroll is NOT overshooting.

This CMP statement will remove any performance advantage that UNROLL brings in.
If the compiler indeed generated “CMP” statements then the example quoted above should work when “n” is < 5.

Isn’t it?

DenisR · March 20, 2008, 11:49am

#pragma unroll M
for (k=0 ; k< N ; k++)

delivers working code for M<=N. That is the only point I am trying to make. That is what is exactly written in the manual. Not for M = n*N with n = 1,2,3,4,5,… (that will also generate working code, but it is not necessary)

What happens as far as I understand is:

#pragma unroll 3
for (k=0;k<5;k++)
…

will be turned into

k=0;
…
k=1;
…
k=2;
…

for(k = 3;k<5;k++)
…

Sarnath · March 20, 2008, 5:22pm

Denis,

You are right. I was wrong. I just verified it by writing a FOR loop and examined the PTX.

The compiler first divides the run-time “n” value by “m” (unroll factor) and then runs a FOR loop for “N/M” times in which the FOR loop body is expanded “M” times…

And, then it runs a FOR loop for “N%M” times…

I aplogize for mis-directing people over here.

Denis was right.

Best Regards,
Sarnath

DenisR · March 20, 2008, 6:58pm

well, that is interesting to know. I had always thought it only unrolled M times, and then performed a for loop for N-M times. What you are saying is much smarter from the compiler. That is very useful when your minimal value of N is much smaller than the maximal value.

hats of to you btw for being able to follow ptx code, I never came further than the second line in my kernel (tid = threadIdx.x + (blockIdx.x * blockDim.x))

garciav · March 20, 2008, 7:07pm

It’s just a question, but how many time can we expect to gain by using pragma unroll?
Is anybody gain a lot?

Sarnath · March 21, 2008, 8:25am

I once got around 1.3x faster I guess… It all depends on how timeconsuming the FOR loop is and how big is the body of the FOR loop.

I think I wrote about this in a post sometime recently…

Ok, First of all, for a FOR loop – there are M instructions that form the body of FOR loop and 2 instructions (CMP and BRANCH) that are redundant… THey dont constitute to computing…

So, now find out the ratio of between the useless instructions and the FOR loop body… Now multiply this ratio with the total-time taken by FOR loop. That amount of time is a WASTE. You reduce this time by un-rolling. You basically reduce this ratio by increasing the number of useful instructions…

So, you have to work out the math for your FOR loop and decide on what is best. Good Luck

Topic		Replies	Views
loop unrolling CUDA Programming and Performance	11	17006	January 31, 2008
automatic loop unrolling CUDA Programming and Performance	8	11039	July 2, 2009
Loop unroll & remainder perf CUDA Programming and Performance cuda , performance , nvcc	6	3049	April 12, 2022
loop unrolling CUDA Programming and Performance	7	1452	April 4, 2011
#pragma unroll not working? CUDA Programming and Performance	3	4890	June 8, 2009
Understanding unrolling and concurrent memory operations CUDA Programming and Performance	3	3038	July 7, 2015
compiler directive CUDA Programming and Performance	7	6318	June 12, 2008
Problem with unrolling loops CUDA Programming and Performance	9	8566	November 24, 2011
#pragma unroll not behaving as expected CUDA Programming and Performance	1	488	September 10, 2022
forcing loop unrolls CUDA Programming and Performance	4	653	October 11, 2018

#pragma unroll?

Related topics