#Pragma unroll doesn't work?

I’ve been using #pragma unroll since CUDA 2.0Beta. However for some reason with the production CUDA2.0 on linux

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2007 NVIDIA Corporation
Built on Thu_Jun_19_04:48:21_PDT_2008
Cuda compilation tools, release 2.0, V0.2.1221

this #pragma ceased to work completely, even with the simplest example. Is there any additional flag to nvcc that should be added to allow this pragma to have affect?

Thanks

Mark

You’re verifying this using decuda?

No, since I’m generating the code for Gt200 and decuda doesn’t work with it. But anyway, the reference is very handy, thanks!

I’m checking it with very simple sanity check of unrolling incorrect number of loops, which is supposed to produce incorrect results. Put use of #pragma doesn’t change a thing.

Well, actually there are two things:

  1. decuda does work with GT200 generated code, after one deletes the new environment it doesn’t understand, called “constrelocs”, from cubin file

  2. Strangely enough, the compiler does unroll simple loop, but it does not unroll slightly more complex loop. This time this was confirmed by decuda.

Here’s the example of the working code:

#pragma unroll 3

    for (i=0;i<k;i++){

            shmem_cache[i]=i;

    }

}

Here’s the example ( though slightly complex one) of the code that is not unrolled:

for(j=0;j<U;j++){

#pragma unroll 2

for (k=0; k<numMatrices;k++) {

uint mtxOffset;

if (cache_lookup[k]==0){

  mtxOffset=func_call1( k) + func_call2(k); 

  mult1*=*(basePtrCache[k]+mtxOffset);

}else{

  Datatype* cachePtr=cache+func_call3( k)+j;

  mult1*=*(cachePtr);

}

}

}

The unrolling here doesn’t work at all - the code generated with and without the pragma is ptx-wise identical

silbmarks, “numMatrices” is a #define or a const int, right? Have you tried putting in a literal?

It’s actually a template function parameter

i.e. the function from outside is defined as

template

global void foo(xxx){

etc…

}

But even putting there a constant, i.e. 2/3 doesn’t help either

I guess the compiler thinks it’s smarter than us now.

Or the NVIDIA engineers do. Did they forget that CUDA unroll is not the same half-useful thing as on a CPU? Unrolling is critical to convert local memory into registers, and can’t be ignored just because the loop’s big.

Most of the documented compiler attributes and pragmas do not function correctly. Your best bet is to --keep-ptx and DIY or just macro the statements in your loop and duplicate the code that way. “#pragma unroll” unrolls loops its not supposed to and some loops it is supposed to it ignores, allocates registers that are used for nothing other than loop counters, ex: j = 0; for(i = 0; i < 32; i++) { j += foo[j]; }. Alignment attributes are ignored in most cases when the compiler decides to emit loads and stores :-/. Compiler generates bank conflicts when referencing vector types. Best thing to do to tweak performance until nvcc becomes more mature is always check the PTX output.

Again, I’ve submitted reproduction samples of this behavior, no response as of yet.

Yup. I wonder… is it possible to make some good macros/templates that will manually do the unrolling? I’ve tried this before, but couldn’t succeed in making a completely general version. Perhaps there’s a third-party preprocessor that will do the trick?