I have two different kernels that I would like to benchmark. In my setup I control a CUDA enabled COM-Server through MATLAB. As a first quick benchmark I executed each kernel 1000 times and called ::cudaDeviceSynchronize() afterwards, in order to ensure that all tasks are actually finished, when control is returned to host. This way I measured timings for both kernels with the result that kernel 2 is twice as fast as kernel 1. Nice, my optimizations seem to have worked. I had observed this speedup in the context of a complete signal processing chain as well, where successive kernel-functions dependent on the previous data, so these MATLAB measurements should be valid.
But then I wanted to have a look at all the other metrics in the profiler and the timing measurement there was off: Apparently both kernels perform approximately at the same speed. I am really confused right now. Does anyone have an idea what might be going wrong here?
Some more details on my setup: 780 Ti, CUDA 7. The supposedly faster second kernel makes excessive use of atomic functions and does not have a high degree of parallelism, but coalesced memory transactions are ensured, whereas memory transactions were pretty random before in kernel 1.