Unrolling of loops with strides _not_ equal to 1

HannesF99 · January 19, 2015, 8:44am

I would like to know whether the NVCC compiler unrolls (automatically or by forcing it) loops with non-unit strides.
So something like
template <int N, int STRIDE>
for (int i = 0; i < N; i += STRIDE)
<do_something>
Both the array size and the Stride (which is not equal to 1) of course are given as template parameters.

Robert_Crovella · January 19, 2015, 2:49pm

The behvior of the compiler in this case is unspecified. It may partially or fully unroll the loop, or it may not. The behavior may depend on other factors (such as what is in the body of the loop, and exact values of N and STRIDE), and the behavior may change from version to version. Templating doesn’t really matter. The template parameters are known at compile-time, so the compiler decisions should be identical to the non-templated case, but using those specific parameters for loop dimensions.

In general, the ability of the compiler to unroll a loop will be greatly enhanced if it can discover the trip-count of the loop at compile time. Your proposed example seems to meet this criterion. I’m reasonably sure therefore, that in some cases, the compiler might unroll such a loop.

You might want to look at the #pragma unroll directive in the programming guide, and/or inspect the specific behavior of the specific compiler you are using along with your specific code by dumping the SASS.

njuffa · January 19, 2015, 3:12pm

Here are some caveats that used to apply the last time I made detailed observations about loop unrolling with the CUDA compiler. This was several years ago, so some or all of these caveats may have since disappeared, but you may want to keep them in mind for a conservative approach.

One thing to watch out for in the body of loops that one would like to have unrolled is control flow. In general you would want to stick to single-entry, single-exit constructs. This means, avoid ‘goto’ and any of its syntactically sugar-coated cousins, such as ‘break’ and ‘continue’ in the loop itself, or multiple ‘return’ inside an inlined function called from the loop body. From a performance aspect, it is best to avoid 64-bit integer types for loop counters, as that may cause additional overhead for partially unrolled loops.

As txbob points out (and independent of the CUDA context), like any optimization loop unrolling is something that happens at the discretion of the compiler, based on multiple heuristics that take into account trip count, code size etc, and for tighter control you would want to look into #pragma unroll. I have encountered at least two instances where I had to instruct the compiler to stop unrolling (with #pragma unroll 1) for best performance. Both of these cases involved partial unrolling which carries overhead.

Topic		Replies	Views
BUG? nvcc fails to unroll the loop CUDA Programming and Performance	6	6091	May 26, 2009
forcing loop unrolls CUDA Programming and Performance	4	766	October 11, 2018
Unroll nested for-loops? CUDA Programming and Performance	1	4728	June 14, 2012
#pragma unroll CUDA Programming and Performance	20	5890	July 27, 2010
Problem with unrolling loops CUDA Programming and Performance	9	8723	November 24, 2011
Loop unrolling CUDA Programming and Performance	1	3366	January 21, 2008
automatic loop unrolling CUDA Programming and Performance	8	11227	July 2, 2009
Loop unrolling not done? cannot deduce loop trip count CUDA Programming and Performance	2	1452	May 3, 2010
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6130	September 19, 2008
NVCC won't unroll for loop CUDA Programming and Performance	11	6319	February 18, 2011

Unrolling of loops with strides _not_ equal to 1

Related topics