Loop unrolling

i have a simple code and i want to prevent the compiler from unrolling the 2 loops so i add #pragama unroll 1. I get the running time of 8seconds. Then i do unroll the inner one
by doing #praga unroll. Again i get 8 seconds. If i unroll both i get again 8 s. If i get rid of the pragmas i get the same again.

Can anyone tell me why i cant stop (prevent) compiler from unrolling. I am using CUDA 4.0 and compute capability 2.0. Card is Gefore GTX 470.


Is it possible that that the loops are too large? Can you check the number of registers used with or without the unrolling?

They are not. I have tested the same code on my laptop and it works. On my desktop, its not. But on my laptop i have computability 2.1 and version of CUDA 3.2. Dont know whether it makes any difference.

Have you confirmed the unrolling by looking at the object code, or are you only deducing this from the runtimes?