I have a naïve question about something I do not understand very well. Some time ago, I read that on Windows, the graphics card is somewhat restricted by the OS graphics stack (WDDM driver model) unless one has a pro-level card (Tesla, Quadro) switched into a special headless “TCC” mode.
The context of my question is that I am doing big launches (2^30 = 1 billion threads) on a Windows machine without seeing any such issues. I can click on the start menu and launch programs while the kernel is running – no freezing, crashes, or other unexpected behavior. Is the advice about WDDM drivers outdated? Does OptiX have some internal countermeasures to prevent this failure mode?
The latest version of Mitsuba had added a change to keep OptiX launches small on Windows (~2 million threads per launch) based on this old advice, but now I am wondering whether I should revert the change to keep things simple.
The recommendation to check for the support of compute preemption in that slide deck means the device attributes cudaDevAttrComputePreemptionSupported resp. CU_DEVICE_ATTRIBUTE_COMPUTE_PREEMPTION_SUPPORTED.
Excellent, this clears up many questions. Thanks @droettger!
There is one comment in the PDF that I don’t understand:
Just because you can doesn’t mean you should run kernels for an extended period
Preemption on WDDM comes with some internal scheduling policies that makes it hard to purposely take advantage of compute preemption. The easiest way is to simply design your application without worrying about TDR.
The headline and body of this item seem in conflict to me. The header suggests it’s STILL a bad idea to run long kernels even if preemption is supported. The bottom says not to worry about it. Am I missing something?
I interpret that as recommendation to run shorter kernels which do not need to do preemption in general, since that’s most likely adding OS overhead.
It just got easier to design applications without TDR in mind when running on sufficiently new OS and GPU configurations (Windows 10 RS4 or newer in WDDM2 mode and Pascal or newer GPU architectures). You cannot control if and when preemption is done by the OS though.
If you need to support older architectures (e.g. Maxwell), you’d still need to design the workload to stay below the 2 seconds timeout.