Here’s a bizarre one…
Shared memory atomicAdd()'s that don’t assign the return value to a variable and are at the very end of a function body appear to be skipped.
Single-stepping in Nsight shows no update occurring.
Assigning (and then ignoring) the return value resolves the problem while debugging.
My environment is CUDA 7.5RC + VS2013 + Debug/Win7x64/353.45. I’m targeting compute_50/compute_50.
It took a couple hours to find this.