So this is a weird problem;
A kernel in an application launches many 1024 thread blocks and each thread block uses 32,032 bytes of shared memory.
Usually I do not launch kernels with that many threads per block or with that much shared memory, but this time it seemed necessary.
When I set the Max Used Register value to any number other than zero that kernel seems to not launch at all, but all the other kernels in the application launch successfully.
No error messages or warnings appear at all( I check every device operation for errors in the typical manner), and the timer I set to time that kernel says the elapsed time is 0.0
When I set the Max Used Register value to 0 and compile again that kernel does launch and returns what seems to be a reasonable answer. In that case it take about 2.8 seconds to finish (it is a very large memory bound kernel).
Not sure what this is all about, and I probably will re-design the kernel, but curious about what may be going on.
Could this be a resource issue? The compilation output (with max register set to 0) for that specific kernel looks like this:
1> 176 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 1> ptxas info : Used 62 registers, 32032 bytes smem, 372 bytes cmem
GTX Titan X, Win7 x64, Visual Studio 2012 compiler, CUDA 6.5