this #pragma ceased to work completely, even with the simplest example. Is there any additional flag to nvcc that should be added to allow this pragma to have affect?
No, since I’m generating the code for Gt200 and decuda doesn’t work with it. But anyway, the reference is very handy, thanks!
I’m checking it with very simple sanity check of unrolling incorrect number of loops, which is supposed to produce incorrect results. Put use of #pragma doesn’t change a thing.
I guess the compiler thinks it’s smarter than us now.
Or the NVIDIA engineers do. Did they forget that CUDA unroll is not the same half-useful thing as on a CPU? Unrolling is critical to convert local memory into registers, and can’t be ignored just because the loop’s big.
Most of the documented compiler attributes and pragmas do not function correctly. Your best bet is to --keep-ptx and DIY or just macro the statements in your loop and duplicate the code that way. “#pragma unroll” unrolls loops its not supposed to and some loops it is supposed to it ignores, allocates registers that are used for nothing other than loop counters, ex: j = 0; for(i = 0; i < 32; i++) { j += foo[j]; }. Alignment attributes are ignored in most cases when the compiler decides to emit loads and stores :-/. Compiler generates bank conflicts when referencing vector types. Best thing to do to tweak performance until nvcc becomes more mature is always check the PTX output.
Again, I’ve submitted reproduction samples of this behavior, no response as of yet.
Yup. I wonder… is it possible to make some good macros/templates that will manually do the unrolling? I’ve tried this before, but couldn’t succeed in making a completely general version. Perhaps there’s a third-party preprocessor that will do the trick?