I suspect it is due to the size of the ptx file produced from the code. The largest ptx file I could produce was of 3800 lines, and I believe there is some kind of limit. Is there one? )
It is actually less than “The maximum kernel size is 2 million PTX instructions” mentioned in the CUDA FAQ.
Ok, I placed and used the code that I “unroll” in an empty project, and it compiled just fine with all loops being totaly unrolled and 124 registers and no local memory used. The ptx file was ~10000 lines. Again, when I put the unrolled code in my program, I get this Error code 128 during compilation. Probably it’s because I run out of registers (I heard about 128 registers per thread limit recall “GPU specific maximum of 128 registers” according to the nvcc_2.3.pdf), well the task now is to move some rarely used data from registers to local memory External Media
Edit: Hey, wait a minute, why can’t compiler continue using local memory instead of registers(as it did before I added unrolling) if there is not enough of those? I believe I read somewhere that it is exactly how it should behave.