What could cause a kernel to take 300x longer from one call to another?

I have a very simple kernel which reads a few values and uses atomicAdd to add values to a particular address of a managed array which should always be on the device. (Which array index is determined by one of the values read from device memory.) It’s taking an exorbitant percentage (~50%) of my program’s runtime (less than 1% of the total kernel calls are this kernel), and its minimum call time is 600 microseconds, average is 156 milliseconds, and maximum is 163 milliseconds. I believe the 600 microsecond call time, and have no idea why it would sometimes take 300 times longer. Each execution is identical - it’s reading and writing data to and from the exact same pointers.

Okay I happened to just discover something strange… this only happens with --unified-memory-profiling off. I have kept this flag because nvprof wouldn’t work on CUDA 8 with unified memory profiling on, but hopefully that bug was resolved. I am still quite concerned how the lack of unified memory profiling could cause the aforementioned problem. The average execution time for this kernel is indeed now under 700 microseconds.

Never mind, I was not able to replicate that behavior (using the exact same executable…)… the plot thickens.