While I greatly appreciate the coding 101 lesson, indeed i have; as I always do, run every check and debug tool under the sun and ALL cuda api calls in the code are checked for their return status. cuda-gdb has been used to meticulously step through every line of code, cuda memcheck et al have all been run. The same is true of the host code with gdb, valgrind et al. There are ZERO errors, mem leaks etc. I hope that puts an end to any further lessons on basic software development.
I’ll try to be a little more specific so I don’t sound so much like I have no idea what I’m doing.
Regarding the “will fail with no errors” comment. Only when running on a 1070, if I compile for > 5.3 not one single line of code in the kernel in question is run but a cuda check after the kernel launch always returns success. The same is true if I don’t compile with PTX.
The kernel in question runs an image processing filter. The filter runs perfectly on a 780M (Compute Capability 3.0) and a Titan X (Compute Capability 5.2).
Regarding point 2, I’m not presuming or asking anyone if the think there is a bug in the compiler or driver. I’m trying to figure out what the problem might be conceptually.
With regard to hardware/architecture compatibility, there are at least the same or more resources on a 6.1 device than a 780 or titan x. More shared mem, more registers per thread etc and on the 1070 in question, there is more global memory (which is irrelevant since the kernel uses a constant hundred or so MB’s regardless of image size). The number of registers used, shared mem per block, etc etc are all within stated limits.