In Windows 10, I was able to perform concurrent Host-to-Device and Device-to-Host transfers using cudaMemcpyAsync(), alongside kernel execution — all running in separate streams. After upgrading to Windows 11, I noticed that kernel execution still overlaps correctly with memory transfers, but Host-to-Device and Device-to-Host transfers no longer run concurrently as they did before.
Test Procedure:
Allocate pinned host memory using cudaHostAlloc()
Allocate device memory using cudaMalloc()
Initiate asynchronous memory copies on separate streams:
cudaMemcpyAsync(d_in, h_in, TEST_SIZE, cudaMemcpyHostToDevice, stream1)
cudaMemcpyAsync(h_out, d_out, TEST_SIZE, cudaMemcpyDeviceToHost, stream2)
Launch a kernel on stream3
Synchronize using cudaDeviceSynchronize()
Results:
Figure 1 (Windows 10): All operations overlap as expected — full concurrency observed.
Is anyone aware of the cause behind this behavior change in Windows 11? Could this be a driver or WDDM scheduling limitation introduced in the newer OS?
I’ve noticed a general behavioral decline in windows reports as well. About 2-5 years ago when the option/switch for GPU HW Scheduling was introduced by Microsoft, it actually seemed like things were getting better. I was having good success telling people to use the switch to see if it improved concurrency scenarios, and in several cases I noted, it did. Those reports are on this forum; I don’t wish to look for them at the moment (should be easy to find with a site: search and hardware accelerated GPU scheduling keywords).
However I agree/have noticed that “lately”, in the last year or so, I have made similar suggestions and the response has been “no change/no improvement”.
The issuing of work in a WDDM environment is at least partly controlled by the Microsoft portion of the GPU driver. For a time there were “hacky” methods in CUDA (such as recording an event and then doing an event query) to try to move already-issued work forward to the GPU. Then the Windows HAGS came along. Now perhaps something else has changed.
I’ll still make the same suggestions: 1. Use a TCC driver model if possible (generally not possible on consumer GPUs.) 2. Try both settings of windows HAGS to see if either is preferable. 3. Try (if you wish) sprinkling event recording steps into your work delivery loop (perhaps at or in place of the cudaDeviceSynchronize()), and event query steps, to see if it makes a difference.
You can also file a bug. But it will be important to provide a well-documented reproducer code and instructions and env description, in that case.
Differences between Win10 and Win11 are certainly plausible, because WDDM remains under active development. Versions 2.x were used for Win10, and versions 3.x for Win11, with WDDM 3.2 from September 2024 being the latest.
I would therefore imagine that in as far as NVIDIA’s CUDA driver contains performance-boosting workarounds for WDDM, the NVIDIA driver engineers are playing a constant game of catch-up to a moving target. I also would not be surprised if some of these workarounds are based on observed rather than officially documented behavior, meaning they are brittle by nature.
From what I can observe, the TCC driver on Windows continues to offer behavior substantially similar to that seen on Linux, so professional applications should target either of those to minimize frustration.
The worst situation is for companies wanting to offer Cuda software running on customer Windows PCs or even notebooks.
accept bad WDDM performance
demand customers should be using Linux
deliver a hardware box (with an embedded PC and Nvidia GPU running on Linux) instead of software
demand customers install a second professional graphics card and install the TCC driver
try to circumvent WDDM shortfalls by using zero-copy memory or by kernel fusion
demand customers should be running Windows 7 or Windows 10 and they should disconnect the internet or add microsoft domains to the firewall blacklist to prevent automatic system updates
Nvidia could make the GPUs register as two devices. One for graphics and one for computation.
Regarding TCC mode — I also prefer to use TCC whenever possible. On my desktop with a single display card (RTX 6000), I’m able to switch to TCC mode using nvidia-smi, and everything works smoothly.
However, on my target deployment system, the setup is a bit different. The machine uses an integrated display, and the RTX 6000 is connected externally via Thunderbolt, acting as a remote GPU. When I switch the RTX 6000 to TCC mode and reboot, the driver fails to load. In Device Manager, the GPU shows a “!” symbol with the error message: “Insufficient system resources exist to complete the API.”
I’m unsure if this issue is related to the Thunderbolt connection, PCIe resource allocation, or something else entirely.
Has anyone encountered a similar issue or have any insights on whether TCC mode is supported or stable over Thunderbolt connections?
Is this kind of configuration officially supported by NVIDIA? I did not even know Thunderbolt is a thing outside of the Apple universe, and Apple kicked NVIDIA to the curb years ago.
Why even go down that route? To my knowledge, 90+% of professional grade HPC takes place on Linux platforms; Windows is a niche. Conversely, Micosoft probably gives a rodent’s behind about CUDA. Just like contractors use professional-grade tools, not DIY / hobbyist ones, why use an OS platform originally designed for office applications and later extended to gaming for professional compute applications?
I believe the dock includes a Thunderbolt-to-PCIe bridge, which is supported by the NVIDIA driver—Windows is able to detect and utilize the GPU properly.
When I switch the RTX 6000 to TCC mode and reboot, the driver fails to load
This suggests either (1) a bug in NVIDIA’s driver stack or (2) unsupported platform to me. If you are convinced it’s (1), you could always file a bug with NVIDIA.
Hi njuffa, I’m not confident enough to conclude that it’s an issue with NVIDIA’s driver stack. I’ve been searching around but haven’t found any information that clearly matches my use case so far.
Thanks again for your help.
Thanks for the suggestions. I’ve tried options #2 and #3, but unfortunately, they didn’t resolve the issue.
As for #1, the PC fails to load the driver after switching to TCC mode.
They kind of combined it with USB into one standard.
First Thunderbolt was made royalty-free opening it up for non-Apple, then Thunderbolt switched the connectors to USB-C and in return the USB forum declared Thunderbolt the next faster official USB standard (USB 4). Before (USB 3) you could have normal USB-C and Thunderbolt capable USB-C. It allows to route display output and/or PCIe lanes over external cables, but with the same protocol.