Kernel randomly fails to launch after several thousand successful launches

traceTest · September 24, 2009, 11:03am

I have a kernel I’m trying to time using this format:

[codebox]

//width and height are the dimensions of InputData and the block is 1 dimensional

gridX = width / block_size + (width%block_size == 0 ? 0:1);

gridY = height;

dim3 dimGrid(gridX, gridY);

cudaError_t err;

for(int k = 0; k < numOfTestRuns; k++){

start = timeGetTime();

         cuda_kernel <<< dimGrid,  block_size>>>  (InputData); 

	 cudaThreadSynchronize(); 

end = timeGetTime();

err = cudaGetLastError();

Durations[k] = (end - start);

}

[/codebox]

I am varying the size of InputData and testing how the kernel performs with various block sizes. I have noticed however, that randomly, the kernel does not get launched - I get a cudaLaunchError. Well, I don’t know if it’s random or there is a pattern, but every now and then, I get that error for some reason not obvious to me at the moment. This happens often for a block size of 208. I find it strange, because, for example, when the block size is 208, the kernel may run successfully 21,000 times in the loop(k = 21,000) and at the 21,001st time, it doesn’t get launched. If I rerun it with the same configurations, it may get a cudaLaunchError the next time at k = 15,000 or so. Isn’t this behaviour very strange? That it would successfully execute so many times, without any changes in its configuration but fail to launch at random values? What could possibly be an explanation for that being the case?

Thanks.

jjp · September 24, 2009, 12:56pm

Maybe there is some rare race condition in your kernel, or a memory leak in your host code. I’d try think of what state your program is in at the 21,000 iteration and what changes in that iteration that makes the kernel fail. After the kernel has crashed once it may leave the device in some kind of dirty state and may cause subsequent crashes. To be save you’d have to reboot.

avidday · September 24, 2009, 2:14pm

The two possibilities that spring to mind are some sort of resource exhaustion or corruption inside the loop, or hardware faults. You can check for the former with something like valgrind or gpuocelot, and latter by eliminating the loop inside you program and running the whole program many times with a shell script.

I had a gpu that would run identical code a number of times with perfect results and then randomly fail to run kernels, which I eventually convinced myself was due to something overheating on the board.

traceTest · September 25, 2009, 10:13am

Thanks jjp and avidday for your suggestions.

I’ll try looking into this with your suggestions some time next week and get back to this thread if I find out why, or at least discover other curious behaviour.

Thanks, and have a nice weekend.

MisterAnderson42 · September 25, 2009, 4:10pm

Things to check for first:

overheating GPU - what is the core temp on your GPU?
out of bounds memory writes in your kernel - one way to check is to compile in emulation mode and run through valgrind on linux.

What GPU are you running this on? Some kernels just inexplicably have this behavior on older hardware: see [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=87803[/url] You’ll find in that thread that the problem seems to go away on GTX 285 and Tesla C1060/S1070/M1060.

Topic		Replies	Views
Cuda kernel sometimes fails to launch Teaching & Curriculum Support	1	2493	February 10, 2016
Kernel failure Unspecified launch failure CUDA Programming and Performance	1	1513	February 25, 2008
cutilCheckMsg("kernel launch failure"); unknown error. CUDA Programming and Performance	1	1348	October 27, 2010
Getting around apparent CUDA bugs CUDA Programming and Performance	5	1075	September 20, 2011
cudaErrorLaunchFailure without any apparent occurrence pattern? CUDA Programming and Performance	6	1396	November 10, 2016
Kernel can not launch CUDA Programming and Performance	2	3913	October 27, 2008
some blocks in kernel can't launch CUDA Programming and Performance	3	793	April 17, 2018
cudaErrorLaunchFailure -- potential causes? CUDA Programming and Performance	1	6796	June 2, 2010
Need help with kernel execution parameters CUDA Programming and Performance	2	2194	November 25, 2012
Random Launch Failure CUDA Programming and Performance	2	1304	March 1, 2010

Kernel randomly fails to launch after several thousand successful launches

Related topics