Watchdog timer kills CUDA code

dcwarren · May 7, 2013, 7:10pm

Hi,

I’m attempting to get a CUDA Fortran code running on Windows, which means dealing with the watchdog timer that kills any GPU thread that lasts longer than some amount of time.

The card I’m using for computation is not device 0 on the machine, and it is not hooked up to any monitors. As such, I’m a bit surprised that the watchdog timer consistently kills my program after a few seconds.

I’ve already broken down the GPU part of the code into the smallest reasonable units of work, so I can’t make any gains there. What are my options?

MatColgrove · May 7, 2013, 9:00pm

The card I’m using for computation is not device 0 on the machine, and it is not hooked up to any monitors. As such, I’m a bit surprised that the watchdog timer consistently kills my program after a few seconds.

I’m surprised as well, since the Watchdog timer should only kill processes on devices with an attached monitor. Try running with the environment variable “PGI_ACC_TIME=1” and double check that the program isn’t accidentally using device 0. You can also set the device number using the environment variable “ACC_DEVICE_NUM=1”.

Other then that, you need to start hacking the registry to disable the watchdog timer.

Note, that I saw this post from someone with a similar issue. However, no one from NVIDIA has answered it yet.
https://forums.geforce.com/default/topic/531745/two-gpu-39-s-still-getting-windows-watchdog-timer/

Mat

dcwarren · May 8, 2013, 2:07pm

Thanks, Mat.

I have a write statement at the top of the program that tells me I’m using GPU #1, not GPU #0, so I know I’ve got the correct one. And pgaccelinfo tells me that both cards have their execution times limited. Looks like it’s registry-editing time for me!

Do you guys have a contact at NVIDIA you could bug about this? It seems like it’s definitely not a problem on PGI’s end, but something a bit deeper.

MatColgrove · May 8, 2013, 5:00pm

Do you guys have a contact at NVIDIA you could bug about this? It seems like it’s definitely not a problem on PGI’s end, but something a bit deeper.

Sure, let me ping Mark Harris who answered the stackoverflow forum question.

Mat

MatColgrove · May 13, 2013, 3:14pm

Here’s the response I received back from my contacts at NVIDIA:

On Windows Vista and later, the watchdog timer applies to all WDDM devices, regardless of whether there is a display attached. For someone hitting the timeouts, they have three choices:

(1) Use a TCC-capable board (e.g., a Tesla) and enable TCC mode with nvidia-smi.
(2) Increase the watchdog timeout in the registry (I prefer this over disabling the timeout completely). A timeout of, say, 30-60 seconds is enough to let most valid cases complete but still reset without rebooting in cases of a true hang.
(3) Change the kernels – or rather the batches of kernels, which are a little hard to predict under WDDM – so they always finish inside the default two seconds maximum.

If one of these solutions is implemented and the app still hangs/TDR’s, then it could be a legitimate deadlock condition in the application code, the compiler-generated code, or the NVIDIA driver, in that order of likelihood.

My best guess is that your device is set to use WDDM (Windows Display Driver Model) instead of TCC (Tesla Compute Cluster) mode. Here’s some documentation I found on how to swtich modes: http://http.developer.nvidia.com/ParallelNsight/2.1/Documentation/UserGuide/HTML/Content/Tesla_Compute_Cluster.htm.

If you are using a non-Tesla card (such as a GTX or Quadro), then your best option would be to increase the Watchdog time out.

Hope this helps,
Mat

dcwarren · May 14, 2013, 1:00pm

Thank you for the info, Mat. I’m now sorely tempted to copy-paste your response into the countless variations of this question I encountered on Stack Overflow/NVIDIA’s forums/everywhere else on the internet.

You’ve also just given me a golden opportunity to bug my bosses about getting a Tesla rig set up. Fingers crossed!