cuMemcpy probabilistically hangs when some condition is met in high GPU load


I met an issue where cuMemcpyDtoHAsync does not return forever when some condition is met in high GPU load.
I have succeeded to minimize the program and uploaded the reproducer:

The repro is like the following:

  1. Initialize a CUDA context.
  2. cuMemAlloc
  3. cuArray3DCreate
  4. cuMemcpy3DAsync (D to H)
  5. cuMemcpy3DAsync (H to D)
  6. cuTexObjectCreate
  7. cuMemcpyDtoHAsync hang here!!

That’s it.

I found that the issue rarely happens without high GPU load.
Recently my machine is running Folding@home in background. If I stop the F@h task, the issue seems hardly happens.

Other random notes:

  • Repro rate is around 80% (high variation) in my environment with background F@h.
  • There are redundant cuCtxSetCurrents but If I remove those, I feel the repro rate decreases. (Possibly my imagination)
  • I can’t identify how which part of the program affects the issue more.
  • In the case I use cuMemcpyDtoH instead, the issue still happens.

Is this a CUDA’s issue or do I do something illegal?

My environment:
Windows 10 20H2
Core i9-9900K, 32GB DDR4
CUDA 11.1
Geforce RTX 3080
NVIDIA Driver 461.09