Why does my kernel take too long occasionally?

Here is what I found so far:

Having one kernel or two kernels in the loop does not make any difference in my case. There are still spikes at random times. However, when I remove the data tranfer (I am using pinned host memory and each thread copies one float) it is much much better. I do not see big spikes in 1,000,000 loops. I tried 5,000,000 loops and I do see one spike happening exactly at loop number 1,611,392 which is a good thing and I believe this is some other issue. At least the randomness is gone when I remove the memcpy. I will focus on this loop number but I you are correct that this definetely has something to with memcpy between host and GPU.

Here is what I found so far:

Having one kernel or two kernels in the loop does not make any difference in my case. There are still spikes at random times. However, when I remove the data tranfer (I am using pinned host memory and each thread copies one float) it is much much better. I do not see big spikes in 1,000,000 loops. I tried 5,000,000 loops and I do see one spike happening exactly at loop number 1,611,392 which is a good thing and I believe this is some other issue. At least the randomness is gone when I remove the memcpy. I will focus on this loop number but I you are correct that this definetely has something to with memcpy between host and GPU.