NVCC efficiency just do it... yourself

Hi,

My kernel had some loops with a constant number of iterations. In such case, all modern C/C++ compilers are able to unroll the loops for optimizing the code. It seems NVCC is not since by unrolling my loops by hand, I obtain a speedup about 15% :blarg:

I hope someone could give me just one reason to no say that NVCC s External Image ks !

The cuda 2.0 beta unrolls loops. You can even control the unrolling with #pragma pack.

With regards to your other recent post, though: sometimes unrolling loops increases register usage :)

You are right, my first idea was to remove some uses of register by unrolling my loops, and it’s work quite well. But, by doing it step by step, I saw somtimes a gain without removing register variables :blarg:

I use CUDA 2.0b but in fact I only read the Programming guide of the v1.0. Effectively, I’m just looking at the 2.0 guide, ch. 4, section #pragma unroll External Image

It seems so that I have to read the last guide :)

Thx for your time and your help, especially when my questions are not very relevant :angel: