I am working on a realtime video processing software using CUDA and DirectX 11. Both technologies are used to access a shared 2D texture in distinct threads. The shared texture is protected and synchronized by means of DirectX Synchronized Shared Surfaces [https://docs.microsoft.com/en-us/windows/win32/direct3darticles/surface-sharing-between-windows-graphics-apis#dxgi-11-synchronized-shared-surfaces].
During development I ran into a deadlock situation which only occurs in conjunction with CUDA graphs. The outline in order to reproduce it is as follows:
Thread 1 Thread 2 cudaGraphLaunch() IDXGIKeyedMutex::AcquireSync() Eventually executes a host node: cudaGraphicsMapResources() IDXGIKeyedMutex::AcquireSync() * cudaGraphicsUnmapResources() * IDXGIKeyedMutex::ReleaseSync() IDXGIKeyedMutex::ReleaseSync()
The deadlock happens in AcquireSync() of Thread 1 and cudaGraphicsUnmapResources() of Thread 2. Their debug callstacks indicate a contention in win32u.dll!NtGdiDdDDIAcquireKeyedMutex2 and nvcuda64.dll!00007ffd83ea99d0.
However, if the use of CUDA graphs is omitted, there is no deadlock:
Thread 1 Thread 2 IDXGIKeyedMutex::AcquireSync() IDXGIKeyedMutex::AcquireSync() IDXGIKeyedMutex::ReleaseSync() cudaGraphicsMapResources() cudaGraphicsUnmapResources() IDXGIKeyedMutex::ReleaseSync()
Both threads operate on their own (non-default, non-blocking) CUDA stream. I used latest versions:
Microsoft Windows 10 Pro, Version 10.0.18362 Build 18362
Quadro GP 100 with driver version 451.22
I attached a minimal reproducing example.
MinimalReproducer.cpp (5.9 KB)
Helpful comments are very appreciated.