loop unrolling

I have following code using loop unrolling:

#pragma unroll

for (int i=0;i<n;i++)




here if n is a defined constant, everything works fine. However, if n is a variable, performance dramatically reduced. I noticed roughly 3 times the instructions are issued and executed. Can anyone justify this? Alternative solution is welcomed as well.

With n constant, is it possible that the compiler can pre-compute the result of some or all of the result of the loop? That sort of optimization would explain the performance difference.

what is occupancy?

could you list register usage by “nvcc -Xptxas -v -arch=sm_20 [source file]”?

Usually, compiler would use more registers to do loop unrolling.
By the way, if n is a constant, then compiler would unroll the loop automatically.

My intention is to use loop unrolling to improve performance. I guess if it’s a compiler level technique, then I won’t be able to change the iteration cycle at run time.

Occupancy is fine. I checked with visual profiler. The number of instructions issued increases 3 times, comparing to unrolled case. I am just looking for ways to do loop unrolling at run time. May be that’s just not feasible.

Hi There, Small Potato!

The compiler un-rolls even if “n” is not a compile time constant…
BUt you need to give the parameter for “unrolling”…, I think…
Try “#pragma unroll 5”… The compiler will generate code that will divide “n” by 5, and then un-roll appropriately.
You can check the PTX… That should tell u clearly what the compiler is doing…

In my finite difference kernels for wave equations (very similar to FDTD3d from SDK) unrolling increases the performance.
However, manual loop unrolling produces even faster code: ptxas reports smaller register usage if I unroll manually compared to #pragma unroll

Loop unrolling happens at compile time.