WDDM timeout while running CUDA on secondary GPU?

https://developer.nvidia.com/cuda-faq#Programming

The Windows watchdog timer causes programs using the primary graphics adapter to time out if they run longer than ~ 5 seconds.”

It is recommended to run CUDA on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it.

Q1: Any GPU? Tesla? Quadro? GeForce? Because GeForce cards can’t be set to TCC mode…
Just install a CUDA capable GeForce card as secondary graphics card, without connecting to any display?
Then the GeForce card isn’t subject to the “watchdog” timer?


The system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

Q2: Does this also work with a non-NVIDIA GPU as primary graphics adapter? Like Intel iGPU or AMD iGPU/dGPU?

Q1: You still need to disable the TDR if your kernel is running longer than that, even for a secondary GPU that is running on WDDM mode.

Q2: Disabling TDR I believe disables it for all graphics cards in the computer, so yes, it is valid for other brands of graphics adapters and not an NVIDIA-only parameter.

Thanks!

Q1: Even if the 2nd GPU is not connected to any display?
I’m asking, because Tesla’s en certain Quadro’s can be set to TCC mode; Geforce cards can’t. But is a GeForce, disconnected from any display, also subject to the WDDM timer?

Q2: NVIDIA says:
For this reason it is recommended that CUDA is run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

So, i was thinking: maybe NVIDIA invented something such that you can run CUDA on any NVIDIA GPU (also a GeForce that is disconnected from any display) for as long as it takes, if and only if the primary graphics adapter (which IS connected to a display) is a NVIDIA GPU too.

Tesla’s are way too expensive for me :-) So, it would be very nice if i can just insert a GeForce in my 2nd PCIe x16 slot to get rid of the time limit. (Next to a Quadro in the 1st PCIe x16 slot.)

“Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.”

Can someone confirm this?

Source: http://irayrender.com/fileadmin/filemount/editor/PDF/iray_Performance_Tips_100511.pdf

You don’t need to buy a different GPU. Just disable the watchdog timer if you’re running kernels longer than I believe what is a default 2 second timeout. It’s simple and just requires a reboot.

Instructions here:
https://devtalk.nvidia.com/default/topic/535264/cuda-programming-and-performance/kernel-runs-fine-on-osx-linux-crashes-on-windows-/post/3762516/#3762516

What GPU2 is useful for (if it’s driving your graphics) is that you can use the GPU1 for CUDA and be able to use graphical programs without having to wait for the screen to redraw or stop responding while running kernels non-stop on GPU1

Thanks! Yes, i think i am going to disable the watchdog timer. But just let me get things straight…

If i install 2 GeForce cards, GPU1 as primary display driver and GPU2 disconnected from any display, then GPU2 is still subjected to the watchdog timer and i still need to to change WDDM settings?

By doing so, i will disable the watchdog timer for both GPU1 and GPU2?

I am asking, because NVIDIA says explicitly:

"[i]On Windows, individual GPU program launches have a maximum run time of around 5 seconds. Exceeding this time limit usually will cause a launch failure reported through the CUDA driver or the CUDA runtime, but in some cases can hang the entire machine, requiring a hard reset.

This is caused by the Windows “watchdog” timer that causes programs using the primary graphics adapter to time out if they run longer than the maximum allowed time.

For this reason it is recommended that CUDA is run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.[/i]"

https://developer.nvidia.com/cuda-faq#Programming

NVIDIA does not advice to disable the watchdog timer. They say: just install 2 NVIDIA GPUs, 1 for display, 1 for compute, problem solved. But is this true?

Because some guy at Puget Systems tried 980 Ti + Titan X in WDDM mode and still run into the time limit.

I wondered if it might only happen if the card that was becoming unresponsive was the primary one, driving the actual GUI / display. So I put both GeForce cards in (980 Ti and Titan X) and ran the benchmark test on just the secondary card… but it still tripped TDR.

Did he make a mistake?

https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/

Like I said earlier… the benefit of having a (second) GPU to drive your displays is that you will not have screen refresh / hang issues if you are running kernels non-stop on the other GPU.

The benefit of disabling the watchdog timer is that… gasp the driver does not kill your running CUDA process after the default timeout delay of 2 or 5 seconds.

Don’t over analyze statements :p

As for the pugetsystems link, no idea. I KNOW for a fact that disabling TDR does work, as long as it’s done correctly and a reboot is done. I’ve done it many times, on different systems, with different cards. That particular case must be user error.

The reason NVIDIA doesn’t mention disabling it is because some unusual conditions (hardware failure, driver issues, etc) the timeout is actually useful to return the machine to a normal state rather than it being hung/stalled.

I honestly am unsure if without disabling TDR, having a card not connected to displays will mean that it is not subject to the timer, because I just disable it altogether. If someone else can chime in with the answer to that, feel free.

Thanks!

When you use 2 GPUs (GPU1 for display, GPU2 for compute) and you disable TDR (i.e. watchdog timer) then does that setting apply to both GPUs?

Or is it possible to keep TDR enabled for GPU1 (driving the display) and disable TDR for GPU2 (running the CUDA code)?

The setting applies to all GPUs in the system.