I assume you call the kernel with blockDim 256 and an array of 256 elements for f. In that case your commented out statements will go out of the allocated array. k will be max 257, because initially it is limited to 255 and then incremented twice. writing f[256] would cause an error.
Further, for some reason ptxas fails when compiling for a target < 2.0 in release mode, but it will compile with (Nsight) debug mode (-G0).
I can’t get the hang of that one yet, but it should be reported as a bug.
After a few changes the kernel produces output, too! (compiled in debug mode)
I assume you call the kernel with blockDim 256 and an array of 256 elements for f. In that case your commented out statements will go out of the allocated array. k will be max 257, because initially it is limited to 255 and then incremented twice. writing f[256] would cause an error.
Further, for some reason ptxas fails when compiling for a target < 2.0 in release mode, but it will compile with (Nsight) debug mode (-G0).
I can’t get the hang of that one yet, but it should be reported as a bug.
After a few changes the kernel produces output, too! (compiled in debug mode)
You are quite right that the bounds problem is not likely to be the cause of the compile memory-access fault (ptxas.exe).
As I mentioned in my (edited) post, the kernel will compile with the code unmasked when targeting SM20 (fermi, release and debug) or SM13 (200 series, in debug only), but has an access fault in other configurations. I have tried numerous variations of the code (after having taken care of the boundsproblem) without success.
So I think there is a bug in the compiler and the problem should be reported to NVIDIA.
You are quite right that the bounds problem is not likely to be the cause of the compile memory-access fault (ptxas.exe).
As I mentioned in my (edited) post, the kernel will compile with the code unmasked when targeting SM20 (fermi, release and debug) or SM13 (200 series, in debug only), but has an access fault in other configurations. I have tried numerous variations of the code (after having taken care of the boundsproblem) without success.
So I think there is a bug in the compiler and the problem should be reported to NVIDIA.
Thank you for bringing this issue to our attention. I am able to reproduce this problem with the CUDA 3.1 toolchain on WinXP64 plus VS2005. I am unable to reproduce this issue with a recent internal toolchain, so it looks like the problem may already be fixed. I will follow up with our compiler team.
Thank you for bringing this issue to our attention. I am able to reproduce this problem with the CUDA 3.1 toolchain on WinXP64 plus VS2005. I am unable to reproduce this issue with a recent internal toolchain, so it looks like the problem may already be fixed. I will follow up with our compiler team.