Why is loop unrolling so good?

Anjul_Patney · November 6, 2007, 7:43pm

Hello People,

I have two questions regarding CUDA:-

How can simple loop unrolling sometimes raise the peak performance of a kernel by huge factors? Is it just the instruction overhead or some kind of out-of-order processing by G80?
What (if any) kind of dynamic branch prediction is employed in the G80 processors? I tried searching the manual (and this forum) and it seems the only branch prediction is done by the compiler. Is there any more information in the public domain?

Thanks,
Anjul

RussAtkinson · November 6, 2007, 9:25pm

Loop unrolling

I’ve seen significant improvement, but not “huge factors” result from loop unrolling. If the loop unrolling resulted in fetch/store coalescing then a big performance improvement could result. I’ve done this a couple of times by hand, but not seen it happen automatically just by replicating the loop body, and I’ve not managed even a factor of 2 by this technique alone.

On a lesser scale loop unrolling could change control flow instructions into predicated instructions, which tend to be more efficient.

Branch prediction

Branch prediction only wins if all of the threads in a warp go in the same direction, so the benefit of putting lots of resources into dynamic branch prediction is much less worthwhile for a GPU than a CPU. So I think that the G80 would not use dynamic branch prediction, whereas static prediction is essentially free, and is done the same for all threads in a warp.

This is just speculation, I don’t have info from nVidia.

asadafag · November 7, 2007, 2:31am

The biggest point of loop unrolling is to put variables in registers:

for(int i=0;i<3;i++)a[i]=0;

would put a[i] in local memory, while

a[0]=a[1]=a[2]=0;

won’t.

This is a big win if you’re going to use this variable a lot later.

e.ping · November 7, 2007, 3:03am

I would have guessed that the biggest benefit from loop unrolling would be to allow pipelining of multiple memory accesses within a single thread. I don’t think that the architecture itself can do this dynamically, but if the compiler can work out that the memory accesses are independent, it can fetch a bunch of them into registers (so unrolling a standard “sum[i] = a[i] + b[i]”-type loop example could initiate 8 reads in a pipelined fashion before trying to use any of the values, as opposed to 2 reads).

Branch and test overhead is also probably not entirely trivial, but I doubt that ‘huge factors’ could be involved there.

Geoff.

wumpus · November 7, 2007, 10:23am

As CUDA is basically a descendent of the shader languages of previous architectures, in which dynamic looping was always discouraged. I think they don’t do branch prediction, or at least very little. Looping in shaders has been classically very slow, and the G80 is a lot faster than previous cards, but still.

Anjul_Patney · November 8, 2007, 7:39am

Why would loops in general be slow on a processor? The only reason I can think of is having large branch stalls.

sphyraena · November 8, 2007, 12:03pm

I haven’t seen anyone mention this yet, but it’s a simple reason why loop unrolling is faster. Consider this:

Math instruction
If not done, goto 1

Only step 1 is doing useful math for you. Even assuming that the processor is very smart and can do the branch in the same amount of time it can do a math operation, the processor is still only 50% utilized to do useful math. But if you unroll the loop like this:

Math instruction
Math instruction
Math instruction
Math instruction
If not done, goto 1

You have 80% processor utilization (4 out of 5) doing useful math. There’s nothing magical about this and this idea applies to almost every computer architecture. Now add in all the other inefficiencies of looping that others have already mentioned, you can see why you want to use that loop instruction as rarely as possible.

wumpus · November 8, 2007, 12:21pm

I experimented a bit on kernels of mine, and it appears that loop unrolling especially wins performance if there are accesses to device memory in the loop. Pure shared-memory loops had very little gain.

Also, you don’t neccesarily gain anything on logic operations when unrolling loops. Let’s say your loop is.

for(int x=tid; x<1000; x+=THREADS)

{

    shared[x] = global[x];

}

you’d unroll it to

x=tid;

while(true)

{

    if(x<1000)

    {

        shared[x] = global[x];

        x += THREADS;

    } else break;

    if(x<1000)

    {

        shared[x] = global[x];

        x += THREADS;

    } else break;

    ...

}

My point is that you do have to check whether you’re finished every loop iteration, unrolled or not. Unless you can assume the size of your data is a multiple of some integer of course.

MisterAnderson42 · November 8, 2007, 1:06pm

Just my 2c to add: While loop unrolling is generally good, if your kernel already suffers from high register usage (and thus low occupancy) it might not be the right decision, sine unrolling the loop will tend to increase the number of registers used. One particular kernel of mine that I unrolled (shmem accesses in the loop, in light of wumpus’s observation), that decreased performance by ~10%. The kernel was at 20 registers used before unrolling.

Topic		Replies	Views
loop unrolling CUDA Programming and Performance	11	16965	January 31, 2008
Understanding unrolling and concurrent memory operations CUDA Programming and Performance	3	2918	July 7, 2015
Where Is the Kernel Code Stored? CUDA Programming and Performance	4	1295	June 19, 2011
Unrolling makes performance worse CUDA Programming and Performance	7	795	April 18, 2019
automatic loop unrolling CUDA Programming and Performance	8	10991	July 2, 2009
How does reducing unrolling or branching code actually reduce instruction fetch? CUDA Programming and Performance	16	2690	December 4, 2016
Loop unroll & remainder perf CUDA Programming and Performance cuda , performance , nvcc	6	2680	April 12, 2022
Problem with unrolling loops CUDA Programming and Performance	9	8522	November 24, 2011
#pragma unroll not behaving as expected CUDA Programming and Performance	1	472	September 10, 2022
CUDA compiler bug or user error? CUDA Programming and Performance	28	2464	July 28, 2017

Why is loop unrolling so good?

Related topics