the program seems to exit instantly, when it seems to me this should be an infinite loop on the device, with the host waiting for it. Even adding printf(“%s\n”,cudaGetErrorString(cudaGetLastError())); at the end of the host code detects no error. Does anyone know of an explanation for this?
I haven’t tried compiling this myself to be sure, but have you looked at the PTX for this kernel? The compiler is very good at removing dead code – by which I essentially mean code that does not contribute to a write to global memory. In that sense, this entire kernel is “dead”, so I wouldn’t be too surprised if the compiler was just making the whole thing a noop…
And of course it’s also possible that the answer is even more simple: there could be a CUDA error being returned from one of these functions that you’re not catching. Add checks for errors (at least in debug mode) to be sure.