Watchdog Timer What exactly is the watchdog timer?

Hi, all –

Executing a kernel a while back, I ran into the following error message:

Cuda Error: The launch timed out and was terminated

Some digging around on the forums indicates this is caused by a Windows Watchdog timer, that is supposed to terminate kernel calls if they run for more than 5 seconds.

So I wrote a small test program that just runs a simple kernel that does some math for a longer and longer period of time until Cuda returns the error (see project attached). Surprisingly, I can get between 8 and 16 seconds of run time before Windows complains. I am running Windows XP Pro SP3 with a GeForce 8600 GTS and plenty of ram.

So: My questions are as follows –

  1. Why does Windows have a watchdog at all?

  2. Why can I run so much longer than everyone says I should be able to?

  3. How can I get around this execution time limit?

I have heard that if I get another video card to run the Windows desktop on, I can have unlimited access to the CUDA device; however that isn’t an option for me right now.

My Thanks,

Ben Weiss

Oregon State University Graphics Group

PS: The attachment was created on top of a CUDA sample program, so it is designed to be built in the CUDA SDK directory hierarchy.
WatchdogTest.zip (90.7 KB)

I have similar problem, none of nvidia guys explained the reason.

In my case, test CUDA device is not connected to the monitor so no watch dogs should be active for it at all. However, each kernel that runs for longer than 10-12 seconds terminates with an error described above.

When I was using CUDA 1.1, this error did not appear, on the other hand, kernel that run into the infinite loop could hang the whole system. With CUDA 2.0 Beta2 situation has changed: kernel that runs for too long (infinite kernel) is terminated externally by the driver (so the system keeps running), however, the same mechanism terminates the kernel that does it’s job well but works for more than 10-12 seconds.

I’ve asked whether it is possible to turn the options of forced termination off but still have no answer.

My understanding:

  1. The card won’t perform a context switch back to rendering until the CUDA kernel is completed. Windows requests the card to context switch, the card doesn’t respond because it’s waiting for the kernel to complete, and then Windows assumes the card or application using the card has hanged and kills whatever is currently using the card–your kernel. It’s a stability thing for display adapters, and most of the time it would be considered correct behavior; it’s just not what people want for CUDA.

  2. No idea; I think five seconds is a conservative estimate (so if you’re under five seconds you’re safe). I’ve seen 5.5 seconds on my own projects before, so I don’t know what to tell you here. Windows works in mysterious ways.

  3. Add a second card or use Linux, preferably from the console. As far as I know, this is a Windows thing that simply cannot be circumvented for a display card.

Romant: is the display extended onto the monitor? Simply disconnecting the card from a monitor doesn’t disable the watchdog timer; the display must not be extended onto that monitor. This is impossible if it’s the only graphics card in the system, sadly.

Thanks!

I have two graphics cards in the system: one does not support CUDA (monitor is connected to it) and one does (8500 GT, nothing is connected to it, I experiment with CUDA on it).

When I call the properties of the screen I see two monitors: one for the primary non-CUDA card and one for 8500 GT. The check box ‘extend desktop onto the monitor’ is NOT CHECKED for the 8500 GT, so I believe that it is not used.

Also, seconds monitor (for 8500 GT) is drawn using etched pattern, I believe this means that it is not active.

I’d like to emphasize that behaviour has changed with CUDA2.0: even faulty infinite kernels were not terminated with CUDA1.1 on the same system with same cards.