Why does my kernel take too long occasionally?

selcuk · October 8, 2010, 2:10am

Hey guys. I have a piece of code something like this:

for (n=0; n<N; n++)
{
clock_gettime(CLOCK_REALTIME,&startTime);
gpuMultKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrInB, dg_fvG, dg_plG);
gpuAccKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrOutB, dg_cpNoiseB);
cudaThreadSynchronize();
clock_gettime(CLOCK_REALTIME,&endTime);
}

I measure the time it takes by using the endTime and startTime, On average it takes say 40us for one loop. But every once in a while (4-5 times in 10,000 loops) I see that it takes over 1ms. Does anybody have any idea why this is happening? When this happens I’m way behind real-time…

This is happening on both GTX285 and Tesla C2050 none of which are used for the display and the development platform is Linux.

I really appreciate any insights and suggestions.

Thanks…

selcuk · October 8, 2010, 2:10am

Hey guys. I have a piece of code something like this:

for (n=0; n<N; n++)
{
clock_gettime(CLOCK_REALTIME,&startTime);
gpuMultKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrInB, dg_fvG, dg_plG);
gpuAccKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrOutB, dg_cpNoiseB);
cudaThreadSynchronize();
clock_gettime(CLOCK_REALTIME,&endTime);
}

I measure the time it takes by using the endTime and startTime, On average it takes say 40us for one loop. But every once in a while (4-5 times in 10,000 loops) I see that it takes over 1ms. Does anybody have any idea why this is happening? When this happens I’m way behind real-time…

This is happening on both GTX285 and Tesla C2050 none of which are used for the display and the development platform is Linux.

I really appreciate any insights and suggestions.

Thanks…

Lev · October 8, 2010, 10:03am

Can you measure separate kernells? also try to simplify your kernells to see if it is OS or kernell code issue. It maybe just long computations if code is complex. And better to try it with different OS.

Lev · October 8, 2010, 10:03am

Can you measure separate kernells? also try to simplify your kernells to see if it is OS or kernell code issue. It maybe just long computations if code is complex. And better to try it with different OS.

jan.heckman · October 8, 2010, 6:44pm

Hey guys. I have a piece of code something like this:

for (n=0; n<N; n++)

{
clock_gettime(CLOCK_REALTIME,&startTime);

gpuMultKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrInB, dg_fvG, dg_plG);

gpuAccKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrOutB, dg_cpNoiseB);

cudaThreadSynchronize();

clock_gettime(CLOCK_REALTIME,&endTime);
}

I measure the time it takes by using the endTime and startTime, On average it takes say 40us for one loop. But every once in a while (4-5 times in 10,000 loops) I see that it takes over 1ms. Does anybody have any idea why this is happening? When this happens I’m way behind real-time…

This is happening on both GTX285 and Tesla C2050 none of which are used for the display and the development platform is Linux.

I really appreciate any insights and suggestions.

Thanks…

hmm… I know that in benchmarking it is usual to do a warm-up round and a lot of runs to take an average. That points to either a lot of variablity or a desire for academic accuracy…

Your call to decide. I don’t know why variation would occur in the absence of other tasks, maybe cpu-side, maybe try to put a cudaThreadSynchronize() between the two kernel calls, if only to see whether that makes the timings more stable. Then, also try to measure gpu time.

jan.heckman · October 8, 2010, 6:44pm

Hey guys. I have a piece of code something like this:

for (n=0; n<N; n++)

{
clock_gettime(CLOCK_REALTIME,&startTime);

gpuMultKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrInB, dg_fvG, dg_plG);

gpuAccKernel<<<blocksPerGrid, g_ThreadsPerBlock>>>(dg_uiptrOutB, dg_cpNoiseB);

cudaThreadSynchronize();

clock_gettime(CLOCK_REALTIME,&endTime);
}

I measure the time it takes by using the endTime and startTime, On average it takes say 40us for one loop. But every once in a while (4-5 times in 10,000 loops) I see that it takes over 1ms. Does anybody have any idea why this is happening? When this happens I’m way behind real-time…

This is happening on both GTX285 and Tesla C2050 none of which are used for the display and the development platform is Linux.

I really appreciate any insights and suggestions.

Thanks…

hmm… I know that in benchmarking it is usual to do a warm-up round and a lot of runs to take an average. That points to either a lot of variablity or a desire for academic accuracy…

Your call to decide. I don’t know why variation would occur in the absence of other tasks, maybe cpu-side, maybe try to put a cudaThreadSynchronize() between the two kernel calls, if only to see whether that makes the timings more stable. Then, also try to measure gpu time.

selcuk · October 9, 2010, 1:29am

Thanks for the responses guys.

I measured the separate kernels and the behavior is somewhat random. Sometimes both kernels take a long time and sometimes it’s the second kernel. My kernels are really simple and the sizes of input vectors for each loop is constant. If I have 10,000 loops this happens sometimes at loop # 1000 sometimes at loop #5000. If I can get things installed on a windows I can try that but unfortunately I don’t have that luxury at the moment.

I tried cudaThreadSynchronize() in between kernels and I see the same thing. But that adds another ~10us to processing time so normally I don’t have it. I really don’t think it’s a warm up issue since it happens say loop #5000 when I have 10000 loops. The average number is good. Just the random occurrences of these one or two long processing times mess up the real-time processing.

selcuk · October 9, 2010, 1:29am

Thanks for the responses guys.

I measured the separate kernels and the behavior is somewhat random. Sometimes both kernels take a long time and sometimes it’s the second kernel. My kernels are really simple and the sizes of input vectors for each loop is constant. If I have 10,000 loops this happens sometimes at loop # 1000 sometimes at loop #5000. If I can get things installed on a windows I can try that but unfortunately I don’t have that luxury at the moment.

I tried cudaThreadSynchronize() in between kernels and I see the same thing. But that adds another ~10us to processing time so normally I don’t have it. I really don’t think it’s a warm up issue since it happens say loop #5000 when I have 10000 loops. The average number is good. Just the random occurrences of these one or two long processing times mess up the real-time processing.

daerhu · October 11, 2010, 1:03am

I am having what could be a related issue (see http://forums.nvidia.com/index.php?showtopic=182331), but have not yet found a solution. I reduced the problem to simply timing cudaMemcpy from device to host in a loop. I get timing spikes at one second intervals. I’m wondering if your timing spikes are at regular intervals? Do you see a similar behavior if you run my test case from the above referenced link?

daerhu · October 11, 2010, 1:03am

I am having what could be a related issue (see http://forums.nvidia.com/index.php?showtopic=182331), but have not yet found a solution. I reduced the problem to simply timing cudaMemcpy from device to host in a loop. I get timing spikes at one second intervals. I’m wondering if your timing spikes are at regular intervals? Do you see a similar behavior if you run my test case from the above referenced link?

Smokey · October 11, 2010, 1:46am

Just to keep the investigation going, I’ve been experiencing this exact problem for almost 2 years now in CUDA (and made many threads about it a while back).

I never got to the bottom of it, but it seems to be related to some kind of unreliable scheduling or memory transferring mechanism - it seems far more prominent when you run many different kernels, and rarely happens when you simply run the same kernel… Which is a pain, because our system currently runs 60+ different kernels thousands of times a second.

Smokey · October 11, 2010, 1:46am

Just to keep the investigation going, I’ve been experiencing this exact problem for almost 2 years now in CUDA (and made many threads about it a while back).

I never got to the bottom of it, but it seems to be related to some kind of unreliable scheduling or memory transferring mechanism - it seems far more prominent when you run many different kernels, and rarely happens when you simply run the same kernel… Which is a pain, because our system currently runs 60+ different kernels thousands of times a second.

daerhu · October 11, 2010, 7:38pm

I was able to achieve more reliable timing (i.e., no timing spikes at regular intervals) by setting the cpu affinity via either sched_setaffinity() or numactl. I’m running on a dual 6-core motherboard. My guess is that some cores are servicing interrupts (from what, I don’t know yet) or running some other linux kernel space code at regular intervals. In my case, I was getting timing stalls on the order of 13 - 14 ms.

This thread (http://forums.nvidia.com/index.php?showtopic=104243) helped me find a solution.

daerhu · October 11, 2010, 7:38pm

I was able to achieve more reliable timing (i.e., no timing spikes at regular intervals) by setting the cpu affinity via either sched_setaffinity() or numactl. I’m running on a dual 6-core motherboard. My guess is that some cores are servicing interrupts (from what, I don’t know yet) or running some other linux kernel space code at regular intervals. In my case, I was getting timing stalls on the order of 13 - 14 ms.

This thread (http://forums.nvidia.com/index.php?showtopic=104243) helped me find a solution.

selcuk · October 13, 2010, 1:07am

Hi daerhu,

In order to get rid of the overhead introduced by the cudaMemcpy I allocate pinned memory on the Host and transfer data within the kernel. You make a good poing though, although this should be paged-locked memory I am wondering OS has something to do with the “spikes”. The spikes I am seeing are completely random. I think, if yours are periodic this maybe tied to some other “event” and may be resolved.

I am not good with linux much but will ask a friend to do this and possible any other things that may be interfering. Thanks for the input…

selcuk · October 13, 2010, 1:07am

Hi daerhu,

In order to get rid of the overhead introduced by the cudaMemcpy I allocate pinned memory on the Host and transfer data within the kernel. You make a good poing though, although this should be paged-locked memory I am wondering OS has something to do with the “spikes”. The spikes I am seeing are completely random. I think, if yours are periodic this maybe tied to some other “event” and may be resolved.

I am not good with linux much but will ask a friend to do this and possible any other things that may be interfering. Thanks for the input…

selcuk · October 13, 2010, 1:08am

edit

selcuk · October 13, 2010, 1:08am

edit

selcuk · October 13, 2010, 1:12am

When I get a chance I will get rid of the second kernel and give it a shot with one kernel in the loop and post the results. Also, I will remove the data transfers between the host and the GPU and see if the timing is consistent over all loops.

selcuk · October 13, 2010, 1:12am

When I get a chance I will get rid of the second kernel and give it a shot with one kernel in the loop and post the results. Also, I will remove the data transfers between the host and the GPU and see if the timing is consistent over all loops.

Topic		Replies	Views
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	3166	October 18, 2021
Irregularity in the timings A few statistics CUDA Programming and Performance	14	10827	October 24, 2007
Random execution times and freezes with concurent kernels - 2 CUDA Programming and Performance	5	2726	November 10, 2015
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5440	August 17, 2009
Execution timings varying from instance to instance CUDA Programming and Performance	10	572	September 29, 2023
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2910	November 17, 2021
Inconsistent kernel run times CUDA Programming and Performance	12	5909	August 5, 2009
Deminishing performance? CUDA Programming and Performance	29	13353	March 5, 2009
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10700	June 21, 2009
Persistent kernel runs slower when with more threads CUDA Programming and Performance	7	179	October 2, 2024

Why does my kernel take too long occasionally?

Related topics