the programming guide (2.2.1, p17) talks about automatic loop unrolling (as well as unrolling by specifying #pragma [see OpemMP in wikipedia], which I am not talking here). I am talking only about the Automatic unrolling of loops that nvcc does without being specified in the code.
First, I don’t know what is meant by “small loops”, Is a loop that is just a one-liner (but has millions of iterations) considered small? Is a loop with just 10 iterations, but quite a number of lines of code considered small?
Has anyone examples where I can see the action of this? Like a timing of the executable? (Is there a load-meter of the CPU that also indicates how many threads are running? similar as the one for the CPU?. I currently crudely use the GPU temparature guage to see if its under load.
cuda 2.2 for opensuse64 11.1, on quadrofx3700
3.1.2 #pragma unroll
By default, the compiler unrolls small loops with a known trip count. The #pragma
unroll directive however can be used to control unrolling of any given loop. It
must be placed immediately before the loop and only applies to that loop. It is
optionally followed by a number that specifies how many times the loop must be
For example, in this code sample:
#pragma unroll 5
for (int i = 0; i < n; ++i)
the loop will be unrolled 5 times. It is up to the programmer to make sure that
unrolling will not affect the correctness of the program (which it might, in the above
example, if n is smaller than 5).
#pragma unroll 1 will prevent the compiler from ever unrolling a loop.
If no number is specified after #pragma unroll, the loop is completely unrolled
if its trip count is constant, otherwise it is not unrolled at all.