#Pragma unroll doesn't work?

silbmarks · September 14, 2008, 9:56am

I’ve been using #pragma unroll since CUDA 2.0Beta. However for some reason with the production CUDA2.0 on linux

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2007 NVIDIA Corporation
Built on Thu_Jun_19_04:48:21_PDT_2008
Cuda compilation tools, release 2.0, V0.2.1221

this #pragma ceased to work completely, even with the simplest example. Is there any additional flag to nvcc that should be added to allow this pragma to have affect?

Thanks

Mark

alex_dubinsky · September 14, 2008, 5:58pm

You’re verifying this using decuda?

silbmarks · September 14, 2008, 7:27pm

No, since I’m generating the code for Gt200 and decuda doesn’t work with it. But anyway, the reference is very handy, thanks!

I’m checking it with very simple sanity check of unrolling incorrect number of loops, which is supposed to produce incorrect results. Put use of #pragma doesn’t change a thing.

silbmarks · September 14, 2008, 8:45pm

Well, actually there are two things:

decuda does work with GT200 generated code, after one deletes the new environment it doesn’t understand, called “constrelocs”, from cubin file
Strangely enough, the compiler does unroll simple loop, but it does not unroll slightly more complex loop. This time this was confirmed by decuda.

Here’s the example of the working code:

#pragma unroll 3

    for (i=0;i<k;i++){

            shmem_cache[i]=i;

    }

}

Here’s the example ( though slightly complex one) of the code that is not unrolled:

for(j=0;j<U;j++){

#pragma unroll 2

for (k=0; k<numMatrices;k++) {

uint mtxOffset;

if (cache_lookup[k]==0){

  mtxOffset=func_call1( k) + func_call2(k); 

  mult1*=*(basePtrCache[k]+mtxOffset);

}else{

  Datatype* cachePtr=cache+func_call3( k)+j;

  mult1*=*(cachePtr);

}

The unrolling here doesn’t work at all - the code generated with and without the pragma is ptx-wise identical

alex_dubinsky · September 15, 2008, 4:29pm

silbmarks, “numMatrices” is a #define or a const int, right? Have you tried putting in a literal?

silbmarks · September 15, 2008, 5:14pm

It’s actually a template function parameter

i.e. the function from outside is defined as

template

global void foo(xxx){

etc…

}

But even putting there a constant, i.e. 2/3 doesn’t help either

alex_dubinsky · September 15, 2008, 6:20pm

I guess the compiler thinks it’s smarter than us now.

Or the NVIDIA engineers do. Did they forget that CUDA unroll is not the same half-useful thing as on a CPU? Unrolling is critical to convert local memory into registers, and can’t be ignored just because the loop’s big.

pstach · September 18, 2008, 10:50pm

Most of the documented compiler attributes and pragmas do not function correctly. Your best bet is to --keep-ptx and DIY or just macro the statements in your loop and duplicate the code that way. “#pragma unroll” unrolls loops its not supposed to and some loops it is supposed to it ignores, allocates registers that are used for nothing other than loop counters, ex: j = 0; for(i = 0; i < 32; i++) { j += foo[j]; }. Alignment attributes are ignored in most cases when the compiler decides to emit loads and stores :-/. Compiler generates bank conflicts when referencing vector types. Best thing to do to tweak performance until nvcc becomes more mature is always check the PTX output.

Again, I’ve submitted reproduction samples of this behavior, no response as of yet.

alex_dubinsky · September 19, 2008, 1:54am

Yup. I wonder… is it possible to make some good macros/templates that will manually do the unrolling? I’ve tried this before, but couldn’t succeed in making a completely general version. Perhaps there’s a third-party preprocessor that will do the trick?

Topic		Replies	Views
#pragma unroll not working? CUDA Programming and Performance	3	4992	June 8, 2009
Extension cl_nv_pragma_unroll doesn't seem to work CUDA Programming and Performance	4	20218	October 12, 2011
#pragma unroll CUDA Programming and Performance	20	5858	July 27, 2010
CUDA #pragma CUDA Programming and Performance	1	1778	July 28, 2013
CUDA #pragma unroll is slower than unrolled code CUDA Programming and Performance	2	2056	February 7, 2018
Problems about #pragma unroll and auto optimization CUDA Programming and Performance	2	2888	March 10, 2009
Use #pragma unroll with macro CUDA Programming and Performance	7	361	November 26, 2024
Problems about #pragma unroll and auto optimization CUDA Programming and Performance	0	3480	March 8, 2009
Different output of code when not unrolling loop CUDA Programming and Performance	16	1249	August 22, 2022
loop unrolling CUDA Programming and Performance	11	17191	January 31, 2008

#Pragma unroll doesn't work?

Related topics