Unrolling of loops with strides _not_ equal to 1

I would like to know whether the NVCC compiler unrolls (automatically or by forcing it) loops with non-unit strides.
So something like
template <int N, int STRIDE>
for (int i = 0; i < N; i += STRIDE)
Both the array size and the Stride (which is not equal to 1) of course are given as template parameters.

The behvior of the compiler in this case is unspecified. It may partially or fully unroll the loop, or it may not. The behavior may depend on other factors (such as what is in the body of the loop, and exact values of N and STRIDE), and the behavior may change from version to version. Templating doesn’t really matter. The template parameters are known at compile-time, so the compiler decisions should be identical to the non-templated case, but using those specific parameters for loop dimensions.

In general, the ability of the compiler to unroll a loop will be greatly enhanced if it can discover the trip-count of the loop at compile time. Your proposed example seems to meet this criterion. I’m reasonably sure therefore, that in some cases, the compiler might unroll such a loop.

You might want to look at the #pragma unroll directive in the programming guide, and/or inspect the specific behavior of the specific compiler you are using along with your specific code by dumping the SASS.

Here are some caveats that used to apply the last time I made detailed observations about loop unrolling with the CUDA compiler. This was several years ago, so some or all of these caveats may have since disappeared, but you may want to keep them in mind for a conservative approach.

One thing to watch out for in the body of loops that one would like to have unrolled is control flow. In general you would want to stick to single-entry, single-exit constructs. This means, avoid ‘goto’ and any of its syntactically sugar-coated cousins, such as ‘break’ and ‘continue’ in the loop itself, or multiple ‘return’ inside an inlined function called from the loop body. From a performance aspect, it is best to avoid 64-bit integer types for loop counters, as that may cause additional overhead for partially unrolled loops.

As txbob points out (and independent of the CUDA context), like any optimization loop unrolling is something that happens at the discretion of the compiler, based on multiple heuristics that take into account trip count, code size etc, and for tighter control you would want to look into #pragma unroll. I have encountered at least two instances where I had to instruct the compiler to stop unrolling (with #pragma unroll 1) for best performance. Both of these cases involved partial unrolling which carries overhead.