Has anyone noticed
cudaMemcpy behaving strangely since they updated their NVIDIA drivers to 471.11?
In two test programs,
cudaMemcpy is now asynchronous and the function call returns before any of the copying from the host to the device has actually completed. I can reproduce this behaviour with both CUDA 11.4 SDK in C++ and using Alea C#.
I first noticed this behaviour when upgrading my drivers to 471.11 (also reproduced with 471.41), although it could have been introduced earlier. See the following program for example; if you comment out this line, the measured performance of the program collapses (due to now measuring both the memory copy and the kernel running time):
NVIDIA Nsight System is also confirming that the call to
cudaMemcpy is now asynchronous despite implicitly running on the default stream.
I wanted to check first on the forums before officially raising a bug with NVIDIA via https://developer.nvidia.com/nvidia_bug/add.
Because if this is true, this is one serious bug.
- Windows 10 21H1 x64
- Ryzen 5950X + 128GB RAM
- GeForce RTX 3090 FE
- Drivers 471.11
- Resizable BAR Enabled