loop unrolling

has loop unrolling via #pragma unroll any performance effect ???

I have NOT done on “#pragma”. But I have manually un-rolled and I got good performance.

Like without unrolling - I got 6x performance

With unrolling 8 times – I got 7.5x performance

May b, if you unroll more – you get more performance.

HOW UNROLLING HELPS?

The BRANCH that happens in a FOR loop is a WASTE OF TIME. So, the RATIO of USEFUL PROCESSING per BRANCH increases with un-rolling. In my case, I just had 1 C statement in my FOR loop that performed computation.

Just to throw the other side of the coin into the mix, unrolling will not always improve performance. It could possibly result in a much larger register usage and adversely decrease occupancy as a result. As with everything in CUDA, it is highly dependent on your exact algorithm and there is no substitute for experimentation.

Besides straightforward loop unrolling, another unrolling variant known as “loop unroll and jam” can be quite effective if you have multiple nested loops and enough registers still available. This is a key technique that I used to achieve massive performance gains in our Coulomb potential kernels. Amusingly, at the time I wrote the code I didn’t know what the technical term for it was, though I knew the technique from past experience optimizing many codes. This kind of unrolling is something that many compilers won’t do for you, so it’s useful to know how to do it for yourself.

Here are some descriptions of the “loop unroll and jam” method:
[url=“http://docs.hp.com/en/B3909-90003/ch05s09.html”]http://docs.hp.com/en/B3909-90003/ch05s09.html[/url]
[url=“http://citeseer.ist.psu.edu/200762.html”]http://citeseer.ist.psu.edu/200762.html[/url]
http://portal.acm.org/citation.cfm?id=2668…=ACM&coll=GUIDE

Cheers,
John Stone

I’ve found out that especially loops with global memory accesses in them can benefit from unrolling. In loops with only shared memory operations I found hardly any difference.

Whatever performance i had stated in my post above was w.r.t. shared memory. May b, with global memory, it increases manifold. cubb…

Hmm, I have a general C question (As I am not really a C programming hero, I often find myself encountering C-related trouble instead of CUDA-related trouble :( )

I have a macro defined, and have just added #pragma unroll to it, to do some loop unrolling. But using #pragma within a macro is causing trouble. A search on the net makes me think it will even never work like below.

#define MEAN_MIN(var)
sdata[tid] = g_f##var[tid]) + g_f##var[tid+256];
mdata[tid] = fminf(g_f##var[tid], g_f##var[tid+256]);
#pragma unroll 4
for(index = tid + 512; index + 256 < 2048 ; index +=512) {
sdata[tid] +=g_f##var[index] + g_f##var[index+256];
mdata[tid] = fminf(mdata[tid], fminf(g_f##var[index], g_f##var[index+256])); \

Does anybody know of a solution?

The problem here is that #pragma is a preprocessing directive itself - therefore whenever it is encountered by the preprocessor, it tries to apply it directly, instead of considering it part of a macro expansion. Macros expand everything, except other preprocessing commands, which can be a bit annoying at times (Just look at what people have to do to get a working compile-time assertion).

Anyways in your case I would recommend just unrolling by hand or that you do this by defining your own iteration macro that just applies another macro for n-times, since it seems the loop-count you need is a compile-time constant.

Also I’m not certain if the pragma has to be right in front of the for-loop, but I’d suppose - you could check the ptx-file for the answer - I think it’s also possible that the compiler unrolls the loop automatically, if it’s not very big - when in doubt, check the disassembly! :)

To tell you the truth I am also suspecting that it was unrolled by the compiler automatically. Anyway, I have solved my trouble in another way (disposed of the ‘need’ of the macro) and encountered another ‘problem’ that is far less problematic here

Consider this code

#define HASH #

#define TEST_IT() \

HASH include <stdio.h>

TEST_IT()

int main()

{

    int i;

   for(i=0; i<100; i++)

    {

        printf("Halo");

    }

If I use “gcc” to pre-process alone, GCC expands the #include correctly. But the problem is that the pre-processing completed file still has the #include<stdio.h> in it. The pre-processor must ideally have run one more time to expand this. So, this #include is passed directly to the compiling phase and the compiler spits an error.

So, the ideal way would be to first pre-process alone and re-direct the output to another C file and then compile that C file again.

Like this:

gcc -E x.c > y.c

gcc y.c

That would do the trick! I am not aware of some neat compiler option which would instruct the pre-processor to do one more pass. Kernighan-Ritchie could answer it.

Well, to tell you the truth, I am already very happy with your example. If I can use my macros to generate ‘new’ C code, than I can always modify that code to include #pragma directives.

It is really apparent now that I have been programming MATLAB for too long or maybe I should say I have not been programming C for too long :D

Nice to know that it helps you!

Note that:

Whatever example I have quoted is for normal C applications. I am NOT sure how you can do it for CUDA applications. I am not sure how you can make the CUDA compiler do this magic.

It depends on the how and when pre-processing is done for the CUDA apps.

It is possible that we can use the native compilers “pre-processing” engine to do the pre-processing first and then we can call the CUDA compiler. That would also make sense. I hope the Pre-processing engine will NOT worry too much about the “global” and other CUDA specific C extensions. If it does then we need to take help from the pre-processing engine of the NVCC itself. At this point, I am not sure what options are available for this.