Big OptiX kernel launches on Windows (WDDM driver model)

Dear OptiX team,

I have a naïve question about something I do not understand very well. Some time ago, I read that on Windows, the graphics card is somewhat restricted by the OS graphics stack (WDDM driver model) unless one has a pro-level card (Tesla, Quadro) switched into a special headless “TCC” mode.

The problem with WDDM was that a long-running kernel would freeze the OS GUI, eventually causing a watchdog timer to expire and restart the graphics driver (thereby crashing the long-running kernel). I can’t remember the article, but I e.g. saw it mentioned here: "Display driver stopped responding and has recovered" WDDM Timeout Detection and Recovery

The context of my question is that I am doing big launches (2^30 = 1 billion threads) on a Windows machine without seeing any such issues. I can click on the start menu and launch programs while the kernel is running – no freezing, crashes, or other unexpected behavior. Is the advice about WDDM drivers outdated? Does OptiX have some internal countermeasures to prevent this failure mode?

The latest version of Mitsuba had added a change to keep OptiX launches small on Windows (~2 million threads per launch) based on this old advice, but now I am wondering whether I should revert the change to keep things simple.

Thanks,
Wenzel

Hi Wenzel,

it’s a matter of OS and GPU support for preemption methods. The newer the OS and GPU, the more is this supported.

This presentation explains it nicely:
https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9957-using-cuda-on-windows.pdf

The recommendation to check for the support of compute preemption in that slide deck means the device attributes cudaDevAttrComputePreemptionSupported resp. CU_DEVICE_ATTRIBUTE_COMPUTE_PREEMPTION_SUPPORTED.

1 Like

Excellent, this clears up many questions. Thanks @droettger!

There is one comment in the PDF that I don’t understand:

Just because you can doesn’t mean you should run kernels for an extended period

Preemption on WDDM comes with some internal scheduling policies that makes it hard to purposely take advantage of compute preemption. The easiest way is to simply design your application without worrying about TDR.

The headline and body of this item seem in conflict to me. The header suggests it’s STILL a bad idea to run long kernels even if preemption is supported. The bottom says not to worry about it. Am I missing something?

I interpret that as recommendation to run shorter kernels which do not need to do preemption in general, since that’s most likely adding OS overhead.
It just got easier to design applications without TDR in mind when running on sufficiently new OS and GPU configurations (Windows 10 RS4 or newer in WDDM2 mode and Pascal or newer GPU architectures). You cannot control if and when preemption is done by the OS though.
If you need to support older architectures (e.g. Maxwell), you’d still need to design the workload to stay below the 2 seconds timeout.

Thanks, that makes sense!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.