cudaErrorLaunchFailure without any apparent occurrence pattern?

pavelc · November 10, 2016, 10:38am

I implemented rather complex kernel searching stars on astronomical images. While it works perfectly on small images with a few thousands stars, it generates cudaErrorLaunchFailure (result from subsequent cudaMemcpy call) on 4k x 4k image with ~50k stars. Unfortunately even after a week of experiments and googling (including this forum) I am not able to determine the cause.

I believe the problem is not caused by unaligned access (all local data are 4B ints and floats).

There is a dependence on kernel execution time, but there is no clear threshold. It is necessary for kernel to run more than ~2s to fail, but sometimes kernel fails after ~2.1s, on different image it works ok even if the execution time is ~2.8s. When the execution time is ~3 and more it almost always fails.

There is a dependence on image dimensions, but again not a clear one. Full 4k x 4k image always fails. Approx. 3k x 3k crop of ANY portion of image usually works. Smaller crops always works. When the image size is close to the edge, kernel usually works ok several times, then it fails.

There is no dependence on image content. Any portion of the 4k x 4k image can be processes, it is enough to crop any part of the image.

There is no dependence on any single dimension, 4k x 2k and 2k x 4k images always work ok.

I tried to terminate the kernel prematurely by returning from various parts of code, again without any observable pattern. When the execution time is short, error is never generated. When the execution time prolongs by including anther parts of the algorithms, error stars to occur on larger images. There is no specific portion of code to be performed to cause the error, any part of code works ok on smaller image.

I use CUDA v8 and the error occurrence depends on the used hardware. While the error always appear when kernel is launched on full 4k x 4k image on GTX650, the same image can be processes on GTX1060 ok 9-times from 10 tries.

Any hint would be very appreciated.

tera · November 10, 2016, 12:08pm

Have you run the code under cuda-memcheck?
Is you GPU driving a display, which would subject kernels to the watchdog timer?
Windows or Linux?

pavelc · November 10, 2016, 12:17pm

Thank you for reply. I believe cuda-memcheck returns no relevant information:
========= CUDA-MEMCHECK
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleLoadDataEx + 0x2a9108) [0x2b098b]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\cudart64_80.dll (cudaMemcpy + 0x12f) [0x2acef]
========= Host Frame:C:\work\sims\scudalib.dll (scudalib::FindStars + 0x639) [0x2689]
…

I used 2 setups:
i5 3470K + 8GB + GTX650 + Windows 10 64 Pro + CUDA v8
i7 4771 + 16GB + GTX1060 6GB + Windows 10 64 Pro + CUDA v8
GPU drives the display in both cases.

I considered the watchdog timer could be a problem, but as mentioned, sometimes the kernel fails after 2s, sometimes it returns ok after 3s spent inside. Is there any hysteresis in the watchdog?

pavelc · November 10, 2016, 12:41pm

I set HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDriversTdrLevel to TdrLevelOff (0) and the problem disappear. Thank you very much for hint, my assumption that different time, for which the kernel worked ok, eliminates this possibility was obviously wrong.

tera · November 10, 2016, 12:50pm

I don’t know how the watchdog is implemented internally, but in the few cases when I’ve hit it it seemed to not necessarily kick in exactly after the timeout, potentially giving the kernel an extra second or so. That anecdotal experience is quite dated however. So I think this would very much match your description of the problem.

Another thing to consider is that CUDA under the Windows WDDM driver batches kernel calls to reduce launch overhead, and the the timeout applies to the entire batch, not to a single kernel. A call to cudaStreamQuery() forces immediate launch of the batch for the stream.
Sprinkling it over the code at strategic places can be used to ensure no batch goes over the timeout if the individual kernels are short enough.

tera · November 10, 2016, 12:53pm

Note that completely disabling the watchdog means you lose protection in case of program bugs causing long or infinite kernel runtime, so setting a larger timeout value is usually preferred over disabling the watchdog completely.

pavelc · November 10, 2016, 1:00pm

You are right keeping the the TdrLevel on TdrLevelRecover (3) and setting of TdrDelay to much larger value is better solution. Btw. the Testing and debugging TDR during driver development - Windows drivers | Microsoft Docs page says the default TdrDelay value is 2s, which corresponds to my experience.

Topic		Replies	Views
Fatal error:the launch timed out and was terminated CUDA Programming and Performance	5	9890	April 19, 2016
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1942	November 4, 2016
ask for help with weird " unspecified launch failure" CUDA Programming and Performance	9	4161	November 5, 2010
Short kernels cause "unspecified launch failure" CUDA Programming and Performance	3	12427	December 30, 2009
CUDA kernel timeout CUDA Programming and Performance	12	59349	December 22, 2022
Inconsistent results watchdog issues? CUDA Programming and Performance	6	2893	March 19, 2008
unspecified launch failure in prior launch CUDA Programming and Performance	4	7830	September 24, 2007
Filter Problem (the launch timed out and was terminated) CUDA Programming and Performance	4	2483	December 15, 2009
the launch timed out and was terminated strange error on cudamemcpy CUDA Programming and Performance	2	4508	November 29, 2012
cudaErrorLauchTimeout after cudaMemcpy CUDA Programming and Performance	2	1122	April 5, 2010

cudaErrorLaunchFailure without any apparent occurrence pattern?

Related topics