cudaErrorLaunchFailure without any apparent occurrence pattern?

I implemented rather complex kernel searching stars on astronomical images. While it works perfectly on small images with a few thousands stars, it generates cudaErrorLaunchFailure (result from subsequent cudaMemcpy call) on 4k x 4k image with ~50k stars. Unfortunately even after a week of experiments and googling (including this forum) I am not able to determine the cause.

I believe the problem is not caused by unaligned access (all local data are 4B ints and floats).

There is a dependence on kernel execution time, but there is no clear threshold. It is necessary for kernel to run more than ~2s to fail, but sometimes kernel fails after ~2.1s, on different image it works ok even if the execution time is ~2.8s. When the execution time is ~3 and more it almost always fails.

There is a dependence on image dimensions, but again not a clear one. Full 4k x 4k image always fails. Approx. 3k x 3k crop of ANY portion of image usually works. Smaller crops always works. When the image size is close to the edge, kernel usually works ok several times, then it fails.

There is no dependence on image content. Any portion of the 4k x 4k image can be processes, it is enough to crop any part of the image.

There is no dependence on any single dimension, 4k x 2k and 2k x 4k images always work ok.

I tried to terminate the kernel prematurely by returning from various parts of code, again without any observable pattern. When the execution time is short, error is never generated. When the execution time prolongs by including anther parts of the algorithms, error stars to occur on larger images. There is no specific portion of code to be performed to cause the error, any part of code works ok on smaller image.

I use CUDA v8 and the error occurrence depends on the used hardware. While the error always appear when kernel is launched on full 4k x 4k image on GTX650, the same image can be processes on GTX1060 ok 9-times from 10 tries.

Any hint would be very appreciated.

Have you run the code under cuda-memcheck?
Is you GPU driving a display, which would subject kernels to the watchdog timer?
Windows or Linux?

Thank you for reply. I believe cuda-memcheck returns no relevant information:
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleLoadDataEx + 0x2a9108) [0x2b098b]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\cudart64_80.dll (cudaMemcpy + 0x12f) [0x2acef]
========= Host Frame:C:\work\sims\scudalib.dll (scudalib::FindStars + 0x639) [0x2689]

I used 2 setups:
i5 3470K + 8GB + GTX650 + Windows 10 64 Pro + CUDA v8
i7 4771 + 16GB + GTX1060 6GB + Windows 10 64 Pro + CUDA v8
GPU drives the display in both cases.

I considered the watchdog timer could be a problem, but as mentioned, sometimes the kernel fails after 2s, sometimes it returns ok after 3s spent inside. Is there any hysteresis in the watchdog?

I set HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDriversTdrLevel to TdrLevelOff (0) and the problem disappear. Thank you very much for hint, my assumption that different time, for which the kernel worked ok, eliminates this possibility was obviously wrong.

I don’t know how the watchdog is implemented internally, but in the few cases when I’ve hit it it seemed to not necessarily kick in exactly after the timeout, potentially giving the kernel an extra second or so. That anecdotal experience is quite dated however. So I think this would very much match your description of the problem.

Another thing to consider is that CUDA under the Windows WDDM driver batches kernel calls to reduce launch overhead, and the the timeout applies to the entire batch, not to a single kernel. A call to cudaStreamQuery() forces immediate launch of the batch for the stream.
Sprinkling it over the code at strategic places can be used to ensure no batch goes over the timeout if the individual kernels are short enough.

Note that completely disabling the watchdog means you lose protection in case of program bugs causing long or infinite kernel runtime, so setting a larger timeout value is usually preferred over disabling the watchdog completely.

You are right keeping the the TdrLevel on TdrLevelRecover (3) and setting of TdrDelay to much larger value is better solution. Btw. the page says the default TdrDelay value is 2s, which corresponds to my experience.