Why does my kernel take too long occasionally?

Hey guys. I have a piece of code something like this:

for (n=0; n<N; n++)
{
clock_gettime(CLOCK_REALTIME,&startTime);
gpuMultKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrInB, dg_fvG, dg_plG);
gpuAccKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrOutB, dg_cpNoiseB);
cudaThreadSynchronize();
clock_gettime(CLOCK_REALTIME,&endTime);
}

I measure the time it takes by using the endTime and startTime, On average it takes say 40us for one loop. But every once in a while (4-5 times in 10,000 loops) I see that it takes over 1ms. Does anybody have any idea why this is happening? When this happens I’m way behind real-time…

This is happening on both GTX285 and Tesla C2050 none of which are used for the display and the development platform is Linux.

I really appreciate any insights and suggestions.

Thanks…

Hey guys. I have a piece of code something like this:

for (n=0; n<N; n++)
{
clock_gettime(CLOCK_REALTIME,&startTime);
gpuMultKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrInB, dg_fvG, dg_plG);
gpuAccKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrOutB, dg_cpNoiseB);
cudaThreadSynchronize();
clock_gettime(CLOCK_REALTIME,&endTime);
}

I measure the time it takes by using the endTime and startTime, On average it takes say 40us for one loop. But every once in a while (4-5 times in 10,000 loops) I see that it takes over 1ms. Does anybody have any idea why this is happening? When this happens I’m way behind real-time…

This is happening on both GTX285 and Tesla C2050 none of which are used for the display and the development platform is Linux.

I really appreciate any insights and suggestions.

Thanks…

Can you measure separate kernells? also try to simplify your kernells to see if it is OS or kernell code issue. It maybe just long computations if code is complex. And better to try it with different OS.

Can you measure separate kernells? also try to simplify your kernells to see if it is OS or kernell code issue. It maybe just long computations if code is complex. And better to try it with different OS.

hmm… I know that in benchmarking it is usual to do a warm-up round and a lot of runs to take an average. That points to either a lot of variablity or a desire for academic accuracy…

Your call to decide. I don’t know why variation would occur in the absence of other tasks, maybe cpu-side, maybe try to put a cudaThreadSynchronize() between the two kernel calls, if only to see whether that makes the timings more stable. Then, also try to measure gpu time.

hmm… I know that in benchmarking it is usual to do a warm-up round and a lot of runs to take an average. That points to either a lot of variablity or a desire for academic accuracy…

Your call to decide. I don’t know why variation would occur in the absence of other tasks, maybe cpu-side, maybe try to put a cudaThreadSynchronize() between the two kernel calls, if only to see whether that makes the timings more stable. Then, also try to measure gpu time.

Thanks for the responses guys.

I measured the separate kernels and the behavior is somewhat random. Sometimes both kernels take a long time and sometimes it’s the second kernel. My kernels are really simple and the sizes of input vectors for each loop is constant. If I have 10,000 loops this happens sometimes at loop # 1000 sometimes at loop #5000. If I can get things installed on a windows I can try that but unfortunately I don’t have that luxury at the moment.

I tried cudaThreadSynchronize() in between kernels and I see the same thing. But that adds another ~10us to processing time so normally I don’t have it. I really don’t think it’s a warm up issue since it happens say loop #5000 when I have 10000 loops. The average number is good. Just the random occurrences of these one or two long processing times mess up the real-time processing.

Thanks for the responses guys.

I measured the separate kernels and the behavior is somewhat random. Sometimes both kernels take a long time and sometimes it’s the second kernel. My kernels are really simple and the sizes of input vectors for each loop is constant. If I have 10,000 loops this happens sometimes at loop # 1000 sometimes at loop #5000. If I can get things installed on a windows I can try that but unfortunately I don’t have that luxury at the moment.

I tried cudaThreadSynchronize() in between kernels and I see the same thing. But that adds another ~10us to processing time so normally I don’t have it. I really don’t think it’s a warm up issue since it happens say loop #5000 when I have 10000 loops. The average number is good. Just the random occurrences of these one or two long processing times mess up the real-time processing.

I am having what could be a related issue (see http://forums.nvidia.com/index.php?showtopic=182331), but have not yet found a solution. I reduced the problem to simply timing cudaMemcpy from device to host in a loop. I get timing spikes at one second intervals. I’m wondering if your timing spikes are at regular intervals? Do you see a similar behavior if you run my test case from the above referenced link?

I am having what could be a related issue (see http://forums.nvidia.com/index.php?showtopic=182331), but have not yet found a solution. I reduced the problem to simply timing cudaMemcpy from device to host in a loop. I get timing spikes at one second intervals. I’m wondering if your timing spikes are at regular intervals? Do you see a similar behavior if you run my test case from the above referenced link?

Just to keep the investigation going, I’ve been experiencing this exact problem for almost 2 years now in CUDA (and made many threads about it a while back).

I never got to the bottom of it, but it seems to be related to some kind of unreliable scheduling or memory transferring mechanism - it seems far more prominent when you run many different kernels, and rarely happens when you simply run the same kernel… Which is a pain, because our system currently runs 60+ different kernels thousands of times a second.

Just to keep the investigation going, I’ve been experiencing this exact problem for almost 2 years now in CUDA (and made many threads about it a while back).

I never got to the bottom of it, but it seems to be related to some kind of unreliable scheduling or memory transferring mechanism - it seems far more prominent when you run many different kernels, and rarely happens when you simply run the same kernel… Which is a pain, because our system currently runs 60+ different kernels thousands of times a second.

I was able to achieve more reliable timing (i.e., no timing spikes at regular intervals) by setting the cpu affinity via either sched_setaffinity() or numactl. I’m running on a dual 6-core motherboard. My guess is that some cores are servicing interrupts (from what, I don’t know yet) or running some other linux kernel space code at regular intervals. In my case, I was getting timing stalls on the order of 13 - 14 ms.

This thread (http://forums.nvidia.com/index.php?showtopic=104243) helped me find a solution.

I was able to achieve more reliable timing (i.e., no timing spikes at regular intervals) by setting the cpu affinity via either sched_setaffinity() or numactl. I’m running on a dual 6-core motherboard. My guess is that some cores are servicing interrupts (from what, I don’t know yet) or running some other linux kernel space code at regular intervals. In my case, I was getting timing stalls on the order of 13 - 14 ms.

This thread (http://forums.nvidia.com/index.php?showtopic=104243) helped me find a solution.

Hi daerhu,

In order to get rid of the overhead introduced by the cudaMemcpy I allocate pinned memory on the Host and transfer data within the kernel. You make a good poing though, although this should be paged-locked memory I am wondering OS has something to do with the “spikes”. The spikes I am seeing are completely random. I think, if yours are periodic this maybe tied to some other “event” and may be resolved.

I am not good with linux much but will ask a friend to do this and possible any other things that may be interfering. Thanks for the input…

Hi daerhu,

In order to get rid of the overhead introduced by the cudaMemcpy I allocate pinned memory on the Host and transfer data within the kernel. You make a good poing though, although this should be paged-locked memory I am wondering OS has something to do with the “spikes”. The spikes I am seeing are completely random. I think, if yours are periodic this maybe tied to some other “event” and may be resolved.

I am not good with linux much but will ask a friend to do this and possible any other things that may be interfering. Thanks for the input…

edit

edit

When I get a chance I will get rid of the second kernel and give it a shot with one kernel in the loop and post the results. Also, I will remove the data transfers between the host and the GPU and see if the timing is consistent over all loops.

When I get a chance I will get rid of the second kernel and give it a shot with one kernel in the loop and post the results. Also, I will remove the data transfers between the host and the GPU and see if the timing is consistent over all loops.