DirectX Synchronized Shared Surfaces & CUDA graphs


I am working on a realtime video processing software using CUDA and DirectX 11. Both technologies are used to access a shared 2D texture in distinct threads. The shared texture is protected and synchronized by means of DirectX Synchronized Shared Surfaces [].
During development I ran into a deadlock situation which only occurs in conjunction with CUDA graphs. The outline in order to reproduce it is as follows:

Thread 1                                       Thread 2

cudaGraphLaunch()                              IDXGIKeyedMutex::AcquireSync()
    Eventually executes a host node:           cudaGraphicsMapResources()
    IDXGIKeyedMutex::AcquireSync() *           cudaGraphicsUnmapResources() *
    IDXGIKeyedMutex::ReleaseSync()             IDXGIKeyedMutex::ReleaseSync()

The deadlock happens in AcquireSync() of Thread 1 and cudaGraphicsUnmapResources() of Thread 2. Their debug callstacks indicate a contention in win32u.dll!NtGdiDdDDIAcquireKeyedMutex2 and nvcuda64.dll!00007ffd83ea99d0.

However, if the use of CUDA graphs is omitted, there is no deadlock:

Thread 1                                       Thread 2

IDXGIKeyedMutex::AcquireSync()                 IDXGIKeyedMutex::AcquireSync()
IDXGIKeyedMutex::ReleaseSync()                 cudaGraphicsMapResources()

Both threads operate on their own (non-default, non-blocking) CUDA stream. I used latest versions:
Microsoft Windows 10 Pro, Version 10.0.18362 Build 18362
d3d11.dll 10.0.18362.387
CUDA 11.0.182
Quadro GP 100 with driver version 451.22

I attached a minimal reproducing example.
MinimalReproducer.cpp (5.9 KB)

Helpful comments are very appreciated.
Thank you.


I could further generalize my minimal example. I just replaced the IDXGIKeyedMutex with std::mutex and it makes no difference. It still runs into a deadlock situation.
Therefore the problem is not specifically related to Synchronized Shared Surfaces or DirectX at all.
There seems to be a resource/lock that is held during the execution of the callback and I ask you to release it before and re-acquire it after the callback execution, if this is valid to do. Holding a lock in an unpredictable context (callbacks) is not a good idea in general.

Please find my modified example using only std::mutex: mainStdMutex.cpp (5.5 KB)

By the way, do you consider making CUDA graphs open source?