Kernel randomly fails to launch after several thousand successful launches

I have a kernel I’m trying to time using this format:


//width and height are the dimensions of InputData and the block is 1 dimensional

gridX = width / block_size + (width%block_size == 0 ? 0:1);

gridY = height;

dim3 dimGrid(gridX, gridY);

cudaError_t err;

for(int k = 0; k < numOfTestRuns; k++){

start = timeGetTime();

         cuda_kernel <<< dimGrid,  block_size>>>  (InputData); 


end = timeGetTime();

err = cudaGetLastError();

Durations[k] = (end - start);



I am varying the size of InputData and testing how the kernel performs with various block sizes. I have noticed however, that randomly, the kernel does not get launched - I get a cudaLaunchError. Well, I don’t know if it’s random or there is a pattern, but every now and then, I get that error for some reason not obvious to me at the moment. This happens often for a block size of 208. I find it strange, because, for example, when the block size is 208, the kernel may run successfully 21,000 times in the loop(k = 21,000) and at the 21,001st time, it doesn’t get launched. If I rerun it with the same configurations, it may get a cudaLaunchError the next time at k = 15,000 or so. Isn’t this behaviour very strange? That it would successfully execute so many times, without any changes in its configuration but fail to launch at random values? What could possibly be an explanation for that being the case?


Maybe there is some rare race condition in your kernel, or a memory leak in your host code. I’d try think of what state your program is in at the 21,000 iteration and what changes in that iteration that makes the kernel fail. After the kernel has crashed once it may leave the device in some kind of dirty state and may cause subsequent crashes. To be save you’d have to reboot.

The two possibilities that spring to mind are some sort of resource exhaustion or corruption inside the loop, or hardware faults. You can check for the former with something like valgrind or gpuocelot, and latter by eliminating the loop inside you program and running the whole program many times with a shell script.

I had a gpu that would run identical code a number of times with perfect results and then randomly fail to run kernels, which I eventually convinced myself was due to something overheating on the board.

Thanks jjp and avidday for your suggestions.

I’ll try looking into this with your suggestions some time next week and get back to this thread if I find out why, or at least discover other curious behaviour.

Thanks, and have a nice weekend.

Things to check for first:

  1. overheating GPU - what is the core temp on your GPU?
  2. out of bounds memory writes in your kernel - one way to check is to compile in emulation mode and run through valgrind on linux.

What GPU are you running this on? Some kernels just inexplicably have this behavior on older hardware: see You’ll find in that thread that the problem seems to go away on GTX 285 and Tesla C1060/S1070/M1060.