Is it possible to automatically unroll nested for-loops with #pragma unroll? Right now I have to generate the code with a Matlab script and then paste the code into the cu-file.
Yes, you can unroll nested loops with #pragma unroll. I have done so for loop nests up to three level deep, resulting in straightline code. I am not sure what all the restrictions are that the compiler imposes, but I think you will have to start with the innermost loop and work your way up from there. You will need a separate #pragma unroll on the line directly preceeding each loop start.
I assume you are trying to achieve full, not partial unrolling. Make sure your loops have a trip count that can be determined at compile time, simple counted for loops work best. When I tried once the compiler was not able to recognize that a loop based on bit shift until zero had a compile-time determinable trip count. From what I have seen, overall code size may inhibit unrolling, as well as unstructured control flows, such as break, continue, or returns (which are just different ways of saying “goto”). Some of the inlined math functions used to contain those (in particular multiple returns), I have cleaned up most of these for CUDA 5.0 changing the code to single-entry / single-exit constructs.
If you have a relatively simple loop nest that the compiler refuses to unroll, you could post it here (or send me a PM through the forum) and I will follow up with the compiler team.
[Later:] By “counted for loop” I am referring to a for-loop controlled by a variable of type “int”, incremented with unit stride. This doesn’t mean other kind of loops with compile-time derivable trip count can’t be unrolled (e.g. with strided increment), but I am fairly certain that loops controlled by a floating-point variable cannot be unrolled at present. The compiler is constantly improving of course, so the only sure way to find out whether unrolling took place for a particular loop is to inspect the generated code with cuobjdump.