CUDA kernel timeout

Hello

I have some kernel function called one after the other. After I made some changes to one kernel, CUDA driver fails to launch that kernel. The error message is:
" the launch timed out and was terminated. "

CUDA says nothing more.
Does anybody knows what are the in general the reason that makes CUDA unable to launch a kernel

Thanks

This comes up if the execution time of the kernel is too long. A reason could be a infinite loop.

What is too long? is there a way to tell CUDA driver what " too long " is

Thanks

It’s not the driver, it’s the OS. Search for “watchdog”.

1 Like

I don’t know if you want the graphics in your program but I have the same problem for execution time of my kernel greater than of 8-9 seconds on linux.

I don’t need graphics on, so I switched on textual mode (CTRL+F1) and I launched the program from there and it works.

If I launch the same program with the same parameter from a terminal in X it stop with the message: " the launch timed out and was terminated. "

1 Like

Yup
 the kernel timeout is set by-default by the OS

If your GPU is used for both display as well as for CUDA, then you generally get this message (if your kernel executes for too long).
In windows, you can change this value in the registry editor: (WIN-key + R, and then type ‘regedit’ and press enter)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers → TdrLevel =

However, this is NOT RECOMMENDED!!! So, proceed at your own risk!!! >.<

Another possible solution (though a bit costly External Image is to buy another GPU and use it in compute-only mode.

REF:

  1. [url=“http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx”]http://www.microsoft.com/whdc/device/displ...dm_timeout.mspx[/url]
  2. [url=“http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/cudatoolkit_release_notes_windows.txt”]http://developer.download.nvidia.com/compu...tes_windows.txt[/url]

Hello, I tried to add a TdrLevel DWord key and set its value to 0. But it still times out after about 7 or 8 seconds. Did you try the method and does it work for you?

Thank you,

from my experience, usually this is actually not becuase the kernel takes too long time. It may be something wrong in your program, e.g., invaild memory access, incorrect thread synchronizations


You may be right in some cases. For my case, if I reduce the number of threads, the kernel can execute without trouble, given enough time.

I guess this is because the resource on the GPUs is exhausted in your program when there are many threads.

I’m working on Windows Server 2003 with 260GTX and have no timeout somehow (the card is used for drawing graphics as well). Usually my kernel calls take as much as couple of minutes, and no error is returned.
Nevertheless, one day after I changed something in the code, it started to “time out” my kernel. Don’t remember how did I fix that, but it works well now. So probably there are really some cases when insufficient resources causes this to happen.

p.s. checked my register. don’t see any TdrLevel in there, though HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\DCI\Timeout is set to 7. Weird.

p.p.s. I wish I run into this problem sometimes - it is better than Reseting the whole machine after realizing that the card is not gonna return to normal mode and start draw Windows environment again :)

If I kill Xorg on my Ubuntu OS then this “the launch timed out and was terminated.” error disappears.

right. The kernel watchdog is associated with making the GUI happy. If you disable the GUI, you have also disabled the kernel watchdog on linux. this may also be of interest.