cudaDeviceSynchronize sometimes hangs with driver versions after 411.63

Christian.Dicker · October 19, 2018, 3:55pm

I updated to 416.34 recently and noticed that sometimes a cudaDeviceSynchronize call hangs. Sometimes it recovers; sometimes it appears to hang indefinitely. Using several profiling tools (NSight, gpu-z, task manager) I can confirm that the GPU is not being utilized at all during this time. I’ve reproduced this on a 1080Ti, 1070Ti, and Quadro with WDDM drivers. This only seems to happen on Windows. I cannot reproduce this problem on Linux with a 2080 or a 1070. Using the Quadro card on Windows with TCC drivers also works fine. I tried reverting back to driver version 411.63, and still see the same problem. Has anyone else noticed this behavior? This seems like a bug in the driver.

I was told that cudaMalloc and cudaFree cause a device synchronization to take place, so I tried replacing my cudaDeviceSynchronize with a cudaMalloc followed immediately by a cudaFree. Surprisingly, that fixed the problem. So, what is cudaMalloc and cudaFree doing(or not doing) under the hood?

Christian.Dicker · October 22, 2018, 1:44pm

In my original post, I said that the problem was reproducible in Linux until I upgraded to CUDA 10. That was a mistake. The problem was never reproducible in Linux.

njuffa · October 22, 2018, 9:54pm

I am using Quadro P2000 / Win 7 Pro 64-bit / CUDA 8 / WDDM 411.63 and I have not experienced this issue so far. From your description it sounds like a possible livelock scenario?

I have no idea what kind of synchronization cudaMalloc/cudaFree perform under the hood, but it is not difficult to imagine that the amount of synchronization they require may be less than the full hammer provided by cudaDeviceSynchronize.

Christian.Dicker · October 23, 2018, 5:57pm

Thanks for the response njuffa. I’m going to continue trying to tease out the problem. Until your post, I had only tried to reproduce the problem in Windows 10, but I can now confirm that it is reproducible in Windows 7 as well.

njuffa · October 23, 2018, 6:44pm

Since you seem to have a reliable reproducer across a variety of platforms, it may be time to consider filing a bug with NVIDIA.