Why is loop unrolling so good?

Hello People,

I have two questions regarding CUDA:-

  1. How can simple loop unrolling sometimes raise the peak performance of a kernel by huge factors? Is it just the instruction overhead or some kind of out-of-order processing by G80?

  2. What (if any) kind of dynamic branch prediction is employed in the G80 processors? I tried searching the manual (and this forum) and it seems the only branch prediction is done by the compiler. Is there any more information in the public domain?

Thanks,
Anjul

  1. Loop unrolling

I’ve seen significant improvement, but not “huge factors” result from loop unrolling. If the loop unrolling resulted in fetch/store coalescing then a big performance improvement could result. I’ve done this a couple of times by hand, but not seen it happen automatically just by replicating the loop body, and I’ve not managed even a factor of 2 by this technique alone.

On a lesser scale loop unrolling could change control flow instructions into predicated instructions, which tend to be more efficient.

  1. Branch prediction

Branch prediction only wins if all of the threads in a warp go in the same direction, so the benefit of putting lots of resources into dynamic branch prediction is much less worthwhile for a GPU than a CPU. So I think that the G80 would not use dynamic branch prediction, whereas static prediction is essentially free, and is done the same for all threads in a warp.

This is just speculation, I don’t have info from nVidia.

The biggest point of loop unrolling is to put variables in registers:

for(int i=0;i<3;i++)a[i]=0;

would put a[i] in local memory, while

a[0]=a[1]=a[2]=0;

won’t.

This is a big win if you’re going to use this variable a lot later.

I would have guessed that the biggest benefit from loop unrolling would be to allow pipelining of multiple memory accesses within a single thread. I don’t think that the architecture itself can do this dynamically, but if the compiler can work out that the memory accesses are independent, it can fetch a bunch of them into registers (so unrolling a standard “sum[i] = a[i] + b[i]”-type loop example could initiate 8 reads in a pipelined fashion before trying to use any of the values, as opposed to 2 reads).

Branch and test overhead is also probably not entirely trivial, but I doubt that ‘huge factors’ could be involved there.

Geoff.

As CUDA is basically a descendent of the shader languages of previous architectures, in which dynamic looping was always discouraged. I think they don’t do branch prediction, or at least very little. Looping in shaders has been classically very slow, and the G80 is a lot faster than previous cards, but still.

Why would loops in general be slow on a processor? The only reason I can think of is having large branch stalls.

I haven’t seen anyone mention this yet, but it’s a simple reason why loop unrolling is faster. Consider this:

  1. Math instruction

  2. If not done, goto 1

Only step 1 is doing useful math for you. Even assuming that the processor is very smart and can do the branch in the same amount of time it can do a math operation, the processor is still only 50% utilized to do useful math. But if you unroll the loop like this:

  1. Math instruction

  2. Math instruction

  3. Math instruction

  4. Math instruction

  5. If not done, goto 1

You have 80% processor utilization (4 out of 5) doing useful math. There’s nothing magical about this and this idea applies to almost every computer architecture. Now add in all the other inefficiencies of looping that others have already mentioned, you can see why you want to use that loop instruction as rarely as possible.

I experimented a bit on kernels of mine, and it appears that loop unrolling especially wins performance if there are accesses to device memory in the loop. Pure shared-memory loops had very little gain.

Also, you don’t neccesarily gain anything on logic operations when unrolling loops. Let’s say your loop is.

for(int x=tid; x<1000; x+=THREADS)

{

    shared[x] = global[x];

}

you’d unroll it to

x=tid;

while(true)

{

    if(x<1000)

    {

        shared[x] = global[x];

        x += THREADS;

    } else break;

    if(x<1000)

    {

        shared[x] = global[x];

        x += THREADS;

    } else break;

    ...

}

My point is that you do have to check whether you’re finished every loop iteration, unrolled or not. Unless you can assume the size of your data is a multiple of some integer of course.

Just my 2c to add: While loop unrolling is generally good, if your kernel already suffers from high register usage (and thus low occupancy) it might not be the right decision, sine unrolling the loop will tend to increase the number of registers used. One particular kernel of mine that I unrolled (shmem accesses in the loop, in light of wumpus’s observation), that decreased performance by ~10%. The kernel was at 20 registers used before unrolling.