Automatic loop unrolling

I noticed that arrays which are local to a thread will be put in local memory instead of registers. This is kind of expected, as registers are rarely indexable. And the solution is obvious: just unroll the loops.

The problem is that each iteration of my loop is quite complex, and for a fluid simulation in 2D I need 9 iterations. If I want to extend it to 3D, I will need 19 iterations. To keep my code maintainable, I would leave have to leave this optimization until the last stage, when I’m certain that everything else works perfectly. However, the best solution would be if the compiler could do the job for me.

So I’m wondering if there are any plans to add automatic loop unrolling to the compiler anytime soon, so that local arrays can be placed in registers?

Health Warning: I have no relation with nVidia.

I downloaded the source of nvcc (which is based on Open64, and therefore is bound by an open source license) from ftp://download.nvidia.com/CUDAOpen64/nvop…beta-0.8.tar.gz

When unarchived, I found a presentation called nvopencc-tutorial.ppt by Mike Murphy dated 11/06

On the final slide titled “Future Work” it says:

  • new hardware features via intrinsics

  • tune wopt to minimize register pressure

  • unrolling

  • using 16-bit instructions

  • supporting calls

There are no dates.

I’d love to know too.

Garry

Well done Garry! I was going to say to Michael that I found nvcc was already pretty aggressive at unrolling loops. Certainly did more than I was expecting (provided the compiler could calculate the loop count and it was constant). Perhaps your body is deemed to be too big - we need a command line opt flag. My other comment to Michael was you probably don’t want to put anything you don’t want to into registers as you only have 10 (for 100% occupancy) and the compiler leaks registers pretty badly at the moment. Hoping for big improvements in this area for the upcoming release.
Cheers, Eric

What follows is unrelated to loop unrolling, but affects perforamnce. Local memory is uncached and slow (comparable to global memory according to the programming guide). I’d suggest placing the array in shared memory if you can, even if no other threads need to access it (you’d have to either interleave, or pad those shared memory arrays so simulatenous accesses by half-warps wouldn’t cause bank conflicts).

Paulius