Automatic loop unrolling

Michel_Iwaniec · May 18, 2007, 6:44am

I noticed that arrays which are local to a thread will be put in local memory instead of registers. This is kind of expected, as registers are rarely indexable. And the solution is obvious: just unroll the loops.

The problem is that each iteration of my loop is quite complex, and for a fluid simulation in 2D I need 9 iterations. If I want to extend it to 3D, I will need 19 iterations. To keep my code maintainable, I would leave have to leave this optimization until the last stage, when I’m certain that everything else works perfectly. However, the best solution would be if the compiler could do the job for me.

So I’m wondering if there are any plans to add automatic loop unrolling to the compiler anytime soon, so that local arrays can be placed in registers?

GarryB · May 18, 2007, 10:14am

Health Warning: I have no relation with nVidia.

I downloaded the source of nvcc (which is based on Open64, and therefore is bound by an open source license) from ftp://download.nvidia.com/CUDAOpen64/nvop…beta-0.8.tar.gz

When unarchived, I found a presentation called nvopencc-tutorial.ppt by Mike Murphy dated 11/06

On the final slide titled “Future Work” it says:

new hardware features via intrinsics
…
tune wopt to minimize register pressure
unrolling
using 16-bit instructions
supporting calls
…

There are no dates.

I’d love to know too.

Garry

osiris1 · May 19, 2007, 1:09am

Well done Garry! I was going to say to Michael that I found nvcc was already pretty aggressive at unrolling loops. Certainly did more than I was expecting (provided the compiler could calculate the loop count and it was constant). Perhaps your body is deemed to be too big - we need a command line opt flag. My other comment to Michael was you probably don’t want to put anything you don’t want to into registers as you only have 10 (for 100% occupancy) and the compiler leaks registers pretty badly at the moment. Hoping for big improvements in this area for the upcoming release.
Cheers, Eric

paulius · May 19, 2007, 4:45pm

What follows is unrelated to loop unrolling, but affects perforamnce. Local memory is uncached and slow (comparable to global memory according to the programming guide). I’d suggest placing the array in shared memory if you can, even if no other threads need to access it (you’d have to either interleave, or pad those shared memory arrays so simulatenous accesses by half-warps wouldn’t cause bank conflicts).

Paulius

Topic		Replies	Views
Speeding up a kernel Unrolling loops and arrays in registers CUDA Programming and Performance	3	951	May 18, 2012
Ways of manual loop unrolling as a workaround to avoid unnecessary register spills CUDA Programming and Performance	2	6039	December 23, 2010
automatic loop unrolling CUDA Programming and Performance	8	10991	July 2, 2009
Understanding unrolling and concurrent memory operations CUDA Programming and Performance	3	2918	July 7, 2015
union and local memory CUDA Programming and Performance	6	6332	February 20, 2008
loop unrolling CUDA Programming and Performance	7	1448	April 4, 2011
Problem with unrolling loops CUDA Programming and Performance	9	8522	November 24, 2011
Fixed size array, registers and function call CUDA Programming and Performance cuda	3	982	November 25, 2021
Register usage problem after static unroll(code generator) CUDA Programming and Performance	6	1398	July 2, 2009
NVCC efficiency just do it... yourself CUDA Programming and Performance	2	1798	August 5, 2008

Automatic loop unrolling

Related topics