Having one kernel or two kernels in the loop does not make any difference in my case. There are still spikes at random times. However, when I remove the data tranfer (I am using pinned host memory and each thread copies one float) it is much much better. I do not see big spikes in 1,000,000 loops. I tried 5,000,000 loops and I do see one spike happening exactly at loop number 1,611,392 which is a good thing and I believe this is some other issue. At least the randomness is gone when I remove the memcpy. I will focus on this loop number but I you are correct that this definetely has something to with memcpy between host and GPU.
Having one kernel or two kernels in the loop does not make any difference in my case. There are still spikes at random times. However, when I remove the data tranfer (I am using pinned host memory and each thread copies one float) it is much much better. I do not see big spikes in 1,000,000 loops. I tried 5,000,000 loops and I do see one spike happening exactly at loop number 1,611,392 which is a good thing and I believe this is some other issue. At least the randomness is gone when I remove the memcpy. I will focus on this loop number but I you are correct that this definetely has something to with memcpy between host and GPU.