"Loop was not unrolled, too much code expansion" Any workarounds ?

Jimmy_Pettersson · December 16, 2009, 1:50pm

Hi!

I’m currently managing to store a 187 by 66 matrix (single precision) on-chip. The goal is actually to do some heavy duty work on a 240 by 66 matrix, everything on-chip. There is enough space in the register file combined with some shared memory.

I’m using a lot of unrolling to make sure nothing spills over into local memory. When trying to unroll further ( 187+) the compiler starts getting unhappy :

“Advisory: Loop was not unrolled, too much code expansion”

Has anyone experienced similar issues? Any workarounds ?

Grateful for any advice!

Jimmy_Pettersson · December 16, 2009, 5:56pm

hah, i’m the only hit on google for “Loop was not unrolled, too much code expansion”… yay External Image

avidday · December 16, 2009, 6:06pm

Oh to be on the cutting edge of computing :)

MisterAnderson42 · December 16, 2009, 6:25pm

What I’ve seen others do (especially before CUDA even had loop unrolling) was to write their code in a template engine and tweak the the parameters controlling the unrolling of different loops. They then run automatic tests at all the various sets of parameters to find the fastest kernel. You could use a similar technique to “manually” unroll what you need to so you don’t have to trust the compiler.

The best example I’ve seen of this is the work by David Cox and his grad student(s). Presentation and streaming video available here: [url=“Events | Great Lakes Consortium for Petascale Computation”]http://www.greatlakesconsortium.org/events...ore/agenda.html[/url]
I’m speaking of the Thursday keynote: “Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach”

Jimmy_Pettersson · December 16, 2009, 7:14pm

Thanks, will have a check tommorrow!

Jimmy_Pettersson · December 17, 2009, 11:38am

ok, i figured out a workaround. If i doubled the amount of threads being used the unrolling depth could be halved, meanwhile this doesn’t add any ( at least what i can tell now… ) new inter block communication problems. This allows me to keep a 240 by 66 matrix on-chip and I just increased the occupancy.

Any easy fix in my problem but i guess in some apps this would lead to extra reduction steps.