loop unrolling

Devaster · January 8, 2008, 9:59pm

has loop unrolling via #pragma unroll any performance effect ???

Sarnath · January 9, 2008, 2:51am

I have NOT done on “#pragma”. But I have manually un-rolled and I got good performance.

Like without unrolling - I got 6x performance

With unrolling 8 times – I got 7.5x performance

May b, if you unroll more – you get more performance.

HOW UNROLLING HELPS?

The BRANCH that happens in a FOR loop is a WASTE OF TIME. So, the RATIO of USEFUL PROCESSING per BRANCH increases with un-rolling. In my case, I just had 1 C statement in my FOR loop that performed computation.

MisterAnderson42 · January 9, 2008, 5:28am

Just to throw the other side of the coin into the mix, unrolling will not always improve performance. It could possibly result in a much larger register usage and adversely decrease occupancy as a result. As with everything in CUDA, it is highly dependent on your exact algorithm and there is no substitute for experimentation.

tachyon_john · January 9, 2008, 6:50am

Besides straightforward loop unrolling, another unrolling variant known as “loop unroll and jam” can be quite effective if you have multiple nested loops and enough registers still available. This is a key technique that I used to achieve massive performance gains in our Coulomb potential kernels. Amusingly, at the time I wrote the code I didn’t know what the technical term for it was, though I knew the technique from past experience optimizing many codes. This kind of unrolling is something that many compilers won’t do for you, so it’s useful to know how to do it for yourself.

Here are some descriptions of the “loop unroll and jam” method:
[url=“http://docs.hp.com/en/B3909-90003/ch05s09.html”]http://docs.hp.com/en/B3909-90003/ch05s09.html[/url]
[url=“http://citeseer.ist.psu.edu/200762.html”]http://citeseer.ist.psu.edu/200762.html[/url]
http://portal.acm.org/citation.cfm?id=2668…=ACM&coll=GUIDE

Cheers,
John Stone

wumpus · January 9, 2008, 7:56am

I’ve found out that especially loops with global memory accesses in them can benefit from unrolling. In loops with only shared memory operations I found hardly any difference.

Sarnath · January 9, 2008, 8:47am

Whatever performance i had stated in my post above was w.r.t. shared memory. May b, with global memory, it increases manifold. cubb…

DenisR · January 31, 2008, 8:36am

Hmm, I have a general C question (As I am not really a C programming hero, I often find myself encountering C-related trouble instead of CUDA-related trouble :( )

I have a macro defined, and have just added #pragma unroll to it, to do some loop unrolling. But using #pragma within a macro is causing trouble. A search on the net makes me think it will even never work like below.

#define MEAN_MIN(var)
sdata[tid] = g_f##var[tid]) + g_f##var[tid+256];
mdata[tid] = fminf(g_f##var[tid], g_f##var[tid+256]);
#pragma unroll 4
for(index = tid + 512; index + 256 < 2048 ; index +=512) {
sdata[tid] +=g_f##var[index] + g_f##var[index+256];
mdata[tid] = fminf(mdata[tid], fminf(g_f##var[index], g_f##var[index+256])); \

Does anybody know of a solution?

maxpower3141 · January 31, 2008, 9:33am

The problem here is that #pragma is a preprocessing directive itself - therefore whenever it is encountered by the preprocessor, it tries to apply it directly, instead of considering it part of a macro expansion. Macros expand everything, except other preprocessing commands, which can be a bit annoying at times (Just look at what people have to do to get a working compile-time assertion).

Anyways in your case I would recommend just unrolling by hand or that you do this by defining your own iteration macro that just applies another macro for n-times, since it seems the loop-count you need is a compile-time constant.

Also I’m not certain if the pragma has to be right in front of the for-loop, but I’d suppose - you could check the ptx-file for the answer - I think it’s also possible that the compiler unrolls the loop automatically, if it’s not very big - when in doubt, check the disassembly! :)

DenisR · January 31, 2008, 9:37am

To tell you the truth I am also suspecting that it was unrolled by the compiler automatically. Anyway, I have solved my trouble in another way (disposed of the ‘need’ of the macro) and encountered another ‘problem’ that is far less problematic here

Sarnath · January 31, 2008, 9:39am

Consider this code

#define HASH #

#define TEST_IT() \

HASH include <stdio.h>

TEST_IT()

int main()

{

    int i;

   for(i=0; i<100; i++)

    {

        printf("Halo");

    }

If I use “gcc” to pre-process alone, GCC expands the #include correctly. But the problem is that the pre-processing completed file still has the #include<stdio.h> in it. The pre-processor must ideally have run one more time to expand this. So, this #include is passed directly to the compiling phase and the compiler spits an error.

So, the ideal way would be to first pre-process alone and re-direct the output to another C file and then compile that C file again.

Like this:

gcc -E x.c > y.c

gcc y.c

That would do the trick! I am not aware of some neat compiler option which would instruct the pre-processor to do one more pass. Kernighan-Ritchie could answer it.

DenisR · January 31, 2008, 10:02am

Well, to tell you the truth, I am already very happy with your example. If I can use my macros to generate ‘new’ C code, than I can always modify that code to include #pragma directives.

It is really apparent now that I have been programming MATLAB for too long or maybe I should say I have not been programming C for too long :D

Sarnath · January 31, 2008, 10:42am

Nice to know that it helps you!

Note that:

Whatever example I have quoted is for normal C applications. I am NOT sure how you can do it for CUDA applications. I am not sure how you can make the CUDA compiler do this magic.

It depends on the how and when pre-processing is done for the CUDA apps.

It is possible that we can use the native compilers “pre-processing” engine to do the pre-processing first and then we can call the CUDA compiler. That would also make sense. I hope the Pre-processing engine will NOT worry too much about the “global” and other CUDA specific C extensions. If it does then we need to take help from the pre-processing engine of the NVCC itself. At this point, I am not sure what options are available for this.

Topic		Replies	Views
#pragma unroll? CUDA Programming and Performance	15	42574	March 21, 2008
#pragma unroll not behaving as expected CUDA Programming and Performance	1	497	September 10, 2022
Use #pragma unroll with macro CUDA Programming and Performance	7	160	November 26, 2024
compiler directive CUDA Programming and Performance	7	6322	June 12, 2008
CUDA #pragma unroll is slower than unrolled code CUDA Programming and Performance	2	1983	February 7, 2018
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6003	September 19, 2008
Problem with unrolling loops CUDA Programming and Performance	9	8602	November 24, 2011
#pragma unroll not working? CUDA Programming and Performance	3	4906	June 8, 2009
Pragma unroll on nested loops CUDA Programming and Performance	1	864	October 29, 2010
CUDA #pragma CUDA Programming and Performance	1	1735	July 28, 2013

loop unrolling

Related topics