Timeout detection and recovery

Hi,

I have a CUDA kernel, modified OpenCV 3 stereoBM https://docs.opencv.org/3.4/d9/dba/classcv_1_1StereoBM.html. It takes two large images (48 megapixels) as input.

When it runs on Windows 10 1607 machine with NVIDIA GTX 1050 Ti, the launch times out and is terminated. It appears that the launch is terminated when the code runs for more than 2 seconds. The GUI does not freeze during the execution.

However, there is no termination on Windows 10 1809 machine with NVIDIA GT 1030 and the code may run even for 11 seconds. GUI does not freeze too.

The first machine uses TdrDebugMode registry key set to 3 and the second one uses the same key with the same value plus TdrDelay set to 2.

The question is why timeout is detected only on the first machine but not on the second machine?

I run another bit of code: CUDA kernel which consist of a single infinite loop. Execution of this code is terminated only on the first machine.

The entire TDR mechanism is something Microsoft created for its WDDM drivers. Given that, it is best to consult the authoritative documentation provided by the folks who designed it:

https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

Thank you for your reply, njuffa. There is no answer in Microsoft documentation. I thought, someone here could have the same issue or there is something special with GTX 1050 Ti.

Try a controlled experiment. In a controlled experiment, only one variable changes at any time. Here you have different OS versions, different GPUs, different TDR key settings across the two machines. Not a controlled experiment.

The first thing you would want to do is to create the same TDR-relevant keys on both machines and assign each key the same value on both machines. When you do that, does the observed behavior of the two machines match? If you exchange the GPUs between the two machines, how does observed behavior change?

If you can’t figure out what is going on with experiments, I would suggest asking in an appropriate Microsoft forum.

Yes, I used the same TDR keys on the both machines, but the behavior did change. Even setting TdrDelay to 2 on the second machine does not stop long running execution. There is no change in the behavior when I installed GTX 1050 Ti on the second machine.

njuffa, thank you for the suggestions.

It appears that kernel execution is aborted by TDR because the GPU scheduler can not preempt this particular task https://docs.microsoft.com/en-us/windows-hardware/drivers/display/timeout-detection-and-recovery#timeout-detection-in-the-windows-display-driver-model-wddm

I use C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\demo_suite\deviceQuery.exe to check if a device supports preemption.

For some reasons, GTX 1050 Ti does not support Compute preemption on Windows 10 1607, but this is not the case on Windows 1809!

As well, preemption is not supported in Quadro K2200 because of the card`s architecture (whean WDDM mode is on, at least). Compute preemption is supported in Pascal

The card needs to be in TCC mode to avoid the WDDM timeout. Even a GeForce card that is not driving a display still has a WDDM driver stack built on it by Windows (after all, you could go into windows control panel and enable a display on that card at any time). This WDDM driver stack may enforce card responsiveness (i.e. the WDDM watchdog) even if the card is “not being used” by windows.
https://devtalk.nvidia.com/default/topic/959472/cuda-setup-and-installation/how-to-know-what-cards-allow-tcc-mode-/post/4975538/#4975538