cudaMemcpy behaving asynchronously with drivers 471.11+


Has anyone noticed cudaMemcpy behaving strangely since they updated their NVIDIA drivers to 471.11?

In two test programs, cudaMemcpy is now asynchronous and the function call returns before any of the copying from the host to the device has actually completed. I can reproduce this behaviour with both CUDA 11.4 SDK in C++ and using Alea C#.

I first noticed this behaviour when upgrading my drivers to 471.11 (also reproduced with 471.41), although it could have been introduced earlier. See the following program for example; if you comment out this line, the measured performance of the program collapses (due to now measuring both the memory copy and the kernel running time):

NVIDIA Nsight System is also confirming that the call to cudaMemcpy is now asynchronous despite implicitly running on the default stream.

I wanted to check first on the forums before officially raising a bug with NVIDIA via

Because if this is true, this is one serious bug.



  • Windows 10 21H1 x64
  • Ryzen 5950X + 128GB RAM
  • GeForce RTX 3090 FE
  • Drivers 471.11
  • Resizable BAR Enabled

I was also able to reproduce this on an Intel 9900K with an RTX 2080 Ti. Same driver version.

The function cudaMemcpy() returning before it has completed copying does not strike me as unexpected behavior as its documentation says

“This function exhibits synchronous behavior for most use cases.”

The description that applies to your particular use case is this one

"For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed. "

I would not expect this behavior to be consistent across all CUDA and driver releases… Maybe the size of the staging memory for DMA transfers has changed?

I cannot believe I’ve never registered this part of cudaMemcpy documentation in years of using CUDA. Thank you very much!

I guess the next question I have now, is why the new drivers seem to take so long to initiate the actual copy. See the following screenshot from Nsight System, where nothing happens for more than 40 milliseconds after initiating the copy (the actual copy takes about 65 microseconds).

Could it be that is the actual problem? That using previous drivers, the copy was so quick it wouldn’t have impacted performance measurements. But somehow the new drivers delay the copying by a noticeable amount of time, for some reason?

is this a consistent delay, or just occuring on the first cudaMemcpy() call?

you might want to try to use a page locked host buffer allocated through cudaMallocHost() just to check if its behavior is different.

The code above is run 10x in a loop. Behaviour is consistent.

I’ve tried replacing the memory allocations with cudaMallocHost. Interestingly it makes things worse. Nsight Compute for some reason then does not show when the actual copy occurs on the GPU nor does it then give the associated performance information. On top of that, the kernel launch has a latency of 40 milliseconds, but putting a cudaDeviceSynchronize after the memory copy does not solve that issue (whereas previously the 40ms latency was on the memory copy).

I.e. performance is now terrible all the time, not matter what, when using pinned memory.

I wonder if it’s something weird is going on with the new drivers that added this 40ms latency to some operations under certain conditions. I checked that the GPU frequency wasn’t going idle or things like that.

This is all on my personal workstation using Windows where the GPU is shared with display rendering. So driver WDDM model might influence what’s going on (as opposed to TCC driver model).

I’d be curious if either setting of this had any effect, however I won’t be able to give you much of an explanation of what it does, other than what is there

Enabling Hardware Accelerated GPU Scheduling had no impact.

Edit In fact it made the performance of the actual CUDA kernels worse by about 25%.