I am having problems with the example in chapter 9. The example for shared memory atomics (hist_gpu_shmem_atomics.cu) is supposed to execute much faster than the global memory example (hist_gpu_gmem_atomics.cu), but in my case both take 300-400 ms. Any ideas on how to find the problem? Thanks
Atomic performance depends quite a bit on the GPU you are running on.
CUDA by example was written in the Fermi timeframe. Kepler (for example) introduced much faster global atomics.
“As we can see from the Kepler performance plots, the global atomics perform better than shared in most cases”
Maxwell, OTOH, introduced improved shared atomic performance. So the GPU you are running on matters.