I met an issue where
cuMemcpyDtoHAsync does not return forever when some condition is met in high GPU load.
I have succeeded to minimize the program and uploaded the reproducer:
The repro is like the following:
- Initialize a CUDA context.
cuMemcpy3DAsync(D to H)
cuMemcpy3DAsync(H to D)
I found that the issue rarely happens without high GPU load.
Recently my machine is running Folding@home in background. If I stop the F@h task, the issue seems hardly happens.
Other random notes:
- Repro rate is around 80% (high variation) in my environment with background F@h.
- There are redundant
cuCtxSetCurrents but If I remove those, I feel the repro rate decreases. (Possibly my imagination)
- I can’t identify how which part of the program affects the issue more.
- In the case I use
cuMemcpyDtoHinstead, the issue still happens.
Is this a CUDA’s issue or do I do something illegal?
Windows 10 20H2
Core i9-9900K, 32GB DDR4
Geforce RTX 3080
NVIDIA Driver 461.09