I can confirm what you are seeing but I can’t explain why. The optimization flags are making no difference to the PTX the compiler is generating, which is really weird…
Yes, it’s really strange… Maybe it’s due to gcc…
I’m trying to unroll the first iteration, and then I don’t have the problem, but if the loop is more complicated this solution doesn’t work :(
Nobody can explain us what’s happening?
You will need to include cuda_runtime.h.
You cannot use the <<< >>> syntax in .c, so you will need to move all the kernel launch to the .cu file or use the driver API.
Ok, now it works, but… I can’t do this kind of changes in my real program!!
If it works now, is the problem in nvcc? Hmmm, the program now is quite different. Maybe the compiler does a different optimization… I’m using a function with the kernel lauch in the .cu file.
I confirm the problem on Debian x64 with gcc 4.3.4 and CUDA Toolkit 3.0 beta.
When the kernel is launched for the first time, cudaSetupArgument ends up being called with some ridiculous value in the ‘offset’ argument (0x7fffffffe1d8), and the launch fails with the error “invalid argument” (please check the return values!).
Looks like a problem between some recent gcc optimizations and the CUDA Runtime. Maybe some ABI mismatch, or just a gcc bug?..