automatic loop unrolling

the programming guide (2.2.1, p17) talks about automatic loop unrolling (as well as unrolling by specifying #pragma [see OpemMP in wikipedia], which I am not talking here). I am talking only about the Automatic unrolling of loops that nvcc does without being specified in the code.

First, I don’t know what is meant by “small loops”, Is a loop that is just a one-liner (but has millions of iterations) considered small? Is a loop with just 10 iterations, but quite a number of lines of code considered small?

Has anyone examples where I can see the action of this? Like a timing of the executable? (Is there a load-meter of the CPU that also indicates how many threads are running? similar as the one for the CPU?. I currently crudely use the GPU temparature guage to see if its under load.


cuda 2.2 for opensuse64 11.1, on quadrofx3700

3.1.2 #pragma unroll
By default, the compiler unrolls small loops with a known trip count. The #pragma
unroll directive however can be used to control unrolling of any given loop. It
must be placed immediately before the loop and only applies to that loop. It is
optionally followed by a number that specifies how many times the loop must be
For example, in this code sample:
#pragma unroll 5
for (int i = 0; i < n; ++i)
the loop will be unrolled 5 times. It is up to the programmer to make sure that
unrolling will not affect the correctness of the program (which it might, in the above
example, if n is smaller than 5).
#pragma unroll 1 will prevent the compiler from ever unrolling a loop.
If no number is specified after #pragma unroll, the loop is completely unrolled
if its trip count is constant, otherwise it is not unrolled at all.

The best answer (though a bit frustrating) would be that you need to test it for your own code.

If you think of a simple for loop, it will translate to a counter increment, if statement to see if you should break out of the loop, a jump and any other

low level operators needed to implement this loop. If all you do inside the loop is accumulate some value into register than the overhead of the loop code

would be higher then the operation itself. Thats a good example of where the unrolling would probably benefit you.

In my code for example I had such loop:

for ( int i = 0; i < 256; i++ )

   CalcSomething(.... );

where CalcSomething looks something like this:

if ( someCondition )


	 pos = iBeginPos + threadIdx.x;

	 float2 fvalue1 = tex1Dfetch( tex, pos );

	 float2 fvalue2 = tex1Dfetch( tex, pos + 1 );


	 // do some computations with fValue1 and fValue2.


	 // Write back the results....

	 fRes += fValue1.x + fTemp * ....;


I’ve seen ~20-30% speed up when changing the loop to this:

for ( int i = 0; i < 256; i+=16 )


	CalcSomething( iInput[ i ], ... );

	CalcSomething( iInput[ i + 1 ], ... );

	CalcSomething( iInput[ i + 2 ], ... );


	CalcSomething( iInput[ i + 15 ], ... );


Of course remember to make sure that the unrolling will be valid (i.e. 256 is a multiple of 16) - use padding if needed…

I think 16 was the most optimal case for me, but again this might be per algorithm/code dependant.

As a rule of thumb I think both your samples can benefit from loop unrolling to some degree.

hope this helps


I have seen an example of loop unrolling in one of the SDK examples.

In the separable convolution. In order to execute the convolution, there is either a for loop or a recursive call to a macro.

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

sum += data[smemPos + k] * d_Kernel[KERNEL_RADIUS - k];

#define CONVOLUTION_ROW1(sum, data, smemPos) {sum = \

data[smemPos - 1] * d_Kernel[2] + \

data[smemPos + 0] * d_Kernel[1] + \

data[smemPos + 1] * d_Kernel[0]; \


This is explained in p13 of the white paper by Victor Podlozhnyuk.

I dunno if this answers your question, but that’s the only way i’ve used it so far.

pragma unroll was used for our research project for a nbody kernel…

it provided 15 to 20 % speed increase for the single precision implementation and it also decreased register usage for the double precision somewhat.
The speed increase saturated as we increased the unroll factor. Seems there should be an optimal unroll factor for a particular sequence of flops.

This number is generally a multiple of 16 (half warp) … if you are using shared memory inside your loops.


It does not work.

It does not unroll small loops Automatically, and compile it to be send to the GPU device.

Note, I am not talking about manually unrolling the loop, although manually specifying that the small 16-loop should be unrolled also does not work.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char **argv) {
const int N = 16;
const int M = 100000000;
int i;
int k;
float j;
time_t sec1;
time_t sec2;
time_t sec3;
sec1 = time(NULL);
printf(“start \n”);
#pragma unroll
for (i = 0; i < N; i++) {
for (k = 0; k< M; k++) {
j = rand();

sec2 = time(NULL);
sec3 = sec2 - sec1;
printf("%ld seconds", sec3);
return 0;


to compile I simply use:

“nvcc omp1.c”

That is host code and handled by your host compiler. Whether the #pragma unroll is honored or not is in the hands of what ever compiler was used. nvcc has nothing to do with that example.


nvcc is a compiler driver. When it runs, it parses CUDA specific extensions with the assistance of the standard C++ preprocessor, and then splits device and specific code into two streams. The host code goes straight to the host compiler, the device code goes to Nvidia’s port of the open64 compiler. The #pragma directive you are complaining about is for device code. Your example contains no device code. If gcc or the MS C++ compiler doesn’t honour it, it will have no effect.

Compile with -O2 or -O3, which will turn-on loop unrolling for your host code.