D3D11 device context in a separate thread gets corrupted when CUDA graphics resource mapping is used

richard42g · October 30, 2022, 6:50pm

I have been chasing this bug for several weeks. More information is available on a github issue report which I filed for a Unity plugin called AVPro Video, available here:

github.com/RenderHeads/UnityPlugin-AVProVideo

D3D11 device removed in MediaPlayer with GPU textures

opened 10:33PM - 24 Sep 22 UTC

richard42

Windows pinned

**Description** I have written a Window Media Source object (registered DLL f…or Media Foundation framework) which parses omnidirectional media files, decodes and composites them with CUDA, and delivers Direct3D 11 media samples stored on the GPU to the Media Foundation player chain. The entire purpose of all my work was to display video in VR using Unity, SteamVR, and AVProVideo. My decoder software works perfectly as a stand-alone windows application (without Media Foundation), and it works perfectly as a Media Foundation Media Source object using the MFPlayer2 sample application from Microsoft. It even works most of the time with AVProVideo in Unity. But sometimes, something blows up with Direct3D in AVProVideo or Unity during video playback, and I get D3D11 Device Removed errors. This typically happens after 100 to 10,000 frames have been decoded and displayed. Of course when this happens the playback stops and Unity prints error messages about Load or Decode errors. **My Setup** - Unity version: 2020.3.38f1 <DX11> - AVPro Video version: v2.0.9f1-trial or core - Operating system version: Windows 10 - Device model: Quadro RTX 4000, Cuda 11.3.1_465.89 - Video specs (resolution, frame-rate, codec, file size): 10560x5280 (or 3840x2160) **To Reproduce** 1. Install OMAF Media Source plugin 2. Open Unity sample application with AVProVideo and MediaPlayer asset 3. Click 'Browse', select OMAF file to play, watch inspector playback until failure **Prior Debugging Steps** I typically debug by starting the Unity project and attaching to the unity player executable with MSVC 2019. I have been working on debugging this problem for several weeks and have made many observations. The core problem is that something is going wrong with the player code in D3D11. Everything is rolling along perfectly fine in CUDA-land until D3D blows up. The only piece of my code which does anything with the D3D11Device is during the GPU setup function when the playback first begins. I create a pool of D3D11 texture objects (bound to CUDA resources). During playback I just map and unmap these objects to use them in CUDA and hand off completed D3D11 textures via the Media Foundation API. I even made a temporary change to my Media Source object such that it would only decode the first 100 frames and thereafter it would just hand off the same D3D11 textures from the pool over and over again (doing no decoding or CUDA kernel processing). When running like this, the problem still occurs. **Logs** I enabled the D3D11 debug layer using Microsoft's "d3dconfig" tool and ran the player to observe the failure, and here are 3 different test runs which show the log messages right before and right after the problem occurs and the player dies. ``` Compositor: starting new output big frame with ulFrameHevcPTS=35034930. Compositor::recycleOutputFrame(188afe6bd90) with MF 10MHz timestamp=33700333. 3 frames still out. Compositor::RunOutputThread(): Delivering completed frame 188afe6b1c0 with PTS 35034930. 0 frames assembling and 4 out. Compositor: starting new output big frame with ulFrameHevcPTS=35368596. Compositor::recycleOutputFrame(188afe69630) with MF 10MHz timestamp=34034000. 3 frames still out. Compositor: starting new output big frame with ulFrameHevcPTS=35702262. Compositor::RunOutputThread(): Delivering completed frame 188afe6d140 with PTS 35368596. 1 frames assembling and 4 out. Compositor: starting new output big frame with ulFrameHevcPTS=36035928. D3D11 ERROR: ID3D11DeviceContext::Draw: Current Primitive Topology value (0) is not valid. [ EXECUTION ERROR #365: DEVICE_DRAW_INVALID_PRIMITIVETOPOLOGY] D3D11 ERROR: ID3D11DeviceContext::Draw: A Vertex Shader is always required when drawing, but none is currently bound. [ EXECUTION ERROR #341: DEVICE_DRAW_VERTEX_SHADER_NOT_SET] D3D11 ERROR: ID3D11DeviceContext::Draw: Rasterization Unit is enabled (PixelShader is not NULL or Depth/Stencil test is enabled and RasterizedStream is not D3D11_SO_NO_RASTERIZED_STREAM) but position is not provided by the last shader before the Rasterization Unit. [ EXECUTION ERROR #362: DEVICE_DRAW_POSITION_NOT_PRESENT] Compositor: starting new output big frame with ulFrameHevcPTS=36369594. Compositor::recycleOutputFrame(188afe68a60) with MF 10MHz timestamp=34367666. 3 frames still out. Compositor::recycleOutputFrame(188afe68e50) with MF 10MHz timestamp=34701333. 2 frames still out. Compositor::recycleOutputFrame(188afe6b1c0) with MF 10MHz timestamp=35035000. 1 frames still out. Compositor::recycleOutputFrame(188afe6d140) with MF 10MHz timestamp=35368666. 0 frames still out. The thread 0x43f8 has exited with code 0 (0x0). D3D11: Removing Device. D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT] Compositor::RunOutputThread() error: cuGraphicsUnmapResources() failed with status 999. ``` ``` Compositor: starting new output big frame with ulFrameHevcPTS=3002994. Compositor::RunOutputThread(): Delivering completed frame 1ded39cc7f0 with PTS 2669328. 1 frames assembling and 1 out. Compositor: starting new output big frame with ulFrameHevcPTS=3336660. Compositor::recycleOutputFrame(1ded39cc7f0) with MF 10MHz timestamp=2669333. 0 frames still out. The thread 0x4a3c has exited with code 0 (0x0). Compositor: starting new output big frame with ulFrameHevcPTS=3670326. Compositor: starting new output big frame with ulFrameHevcPTS=4003992. Compositor::RunOutputThread(): Delivering completed frame 1ded39cef50 with PTS 3002994. 3 frames assembling and 1 out. Compositor::recycleOutputFrame(1ded39cef50) with MF 10MHz timestamp=3003000. 0 frames still out. Compositor::RunOutputThread(): Delivering completed frame 1ded39cb830 with PTS 3336660. 2 frames assembling and 1 out. The thread 0x3f60 has exited with code 0 (0x0). Compositor::recycleOutputFrame(1ded39cb830) with MF 10MHz timestamp=3336666. 0 frames still out. The thread 0x3d70 has exited with code 0 (0x0). Compositor: starting new output big frame with ulFrameHevcPTS=4337658. Compositor::RunOutputThread(): Delivering completed frame 1ded39cff10 with PTS 3670326. 2 frames assembling and 1 out. D3D11 ERROR: ID3D11DeviceContext::Draw: A Vertex Shader is always required when drawing, but none is currently bound. [ EXECUTION ERROR #341: DEVICE_DRAW_VERTEX_SHADER_NOT_SET] D3D11 ERROR: ID3D11DeviceContext::Draw: Rasterization Unit is enabled (PixelShader is not NULL or Depth/Stencil test is enabled and RasterizedStream is not D3D11_SO_NO_RASTERIZED_STREAM) but position is not provided by the last shader before the Rasterization Unit. [ EXECUTION ERROR #362: DEVICE_DRAW_POSITION_NOT_PRESENT] Compositor: starting new output big frame with ulFrameHevcPTS=4671324. Compositor::recycleOutputFrame(1ded39cff10) with MF 10MHz timestamp=3670333. 0 frames still out. The thread 0x3e24 has exited with code 0 (0x0). D3D11: Removing Device. D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT] Compositor::submitInputFrame() error: cuGraphicsUnmapResources() failed with status 999. ``` ``` Compositor::RunOutputThread(): Delivering completed frame 135ac073f20 with PTS 7340652. 2 frames assembling and 1 out. Compositor: starting new output big frame with ulFrameHevcPTS=8341650. Compositor::recycleOutputFrame(135ac073f20) with MF 10MHz timestamp=7340666. 0 frames still out. The thread 0x14e4 has exited with code 0 (0x0). Compositor: starting new output big frame with ulFrameHevcPTS=8675316. Compositor::RunOutputThread(): Delivering completed frame 13831ce6260 with PTS 7674318. 3 frames assembling and 1 out. Compositor::recycleOutputFrame(13831ce6260) with MF 10MHz timestamp=7674333. 0 frames still out. The thread 0x49b0 has exited with code 0 (0x0). Compositor: starting new output big frame with ulFrameHevcPTS=9008982. Compositor::RunOutputThread(): Delivering completed frame 135ac072780 with PTS 8007984. 3 frames assembling and 1 out. Compositor::recycleOutputFrame(135ac072780) with MF 10MHz timestamp=8008000. 0 frames still out. The thread 0xef0 has exited with code 0 (0x0). Compositor: starting new output big frame with ulFrameHevcPTS=9342648. Compositor: starting new output big frame with ulFrameHevcPTS=9676314. Compositor::RunOutputThread(): Delivering completed frame 135ac076680 with PTS 8341650. 4 frames assembling and 1 out. Compositor::recycleOutputFrame(135ac076680) with MF 10MHz timestamp=8341666. 0 frames still out. The thread 0x694 has exited with code 0 (0x0). Compositor::RunOutputThread(): Delivering completed frame 135ac070fe0 with PTS 8675316. 3 frames assembling and 1 out. Compositor: starting new output big frame with ulFrameHevcPTS=10009980. Compositor::recycleOutputFrame(135ac070fe0) with MF 10MHz timestamp=8675333. 0 frames still out. The thread 0x2bb0 has exited with code 0 (0x0). Compositor: starting new output big frame with ulFrameHevcPTS=10343646. Compositor::RunOutputThread(): Delivering completed frame 135ac075ea0 with PTS 9008982. 4 frames assembling and 1 out. D3D11 ERROR: ID3D11DeviceContext::Draw: Current Primitive Topology value (0) is not valid. [ EXECUTION ERROR #365: DEVICE_DRAW_INVALID_PRIMITIVETOPOLOGY] D3D11 ERROR: ID3D11DeviceContext::Draw: A Vertex Shader is always required when drawing, but none is currently bound. [ EXECUTION ERROR #341: DEVICE_DRAW_VERTEX_SHADER_NOT_SET] D3D11 ERROR: ID3D11DeviceContext::Draw: Rasterization Unit is enabled (PixelShader is not NULL or Depth/Stencil test is enabled and RasterizedStream is not D3D11_SO_NO_RASTERIZED_STREAM) but position is not provided by the last shader before the Rasterization Unit. [ EXECUTION ERROR #362: DEVICE_DRAW_POSITION_NOT_PRESENT] Compositor::recycleOutputFrame(135ac075ea0) with MF 10MHz timestamp=9009000. 0 frames still out. Compositor::RunOutputThread(): Delivering completed frame 135ac073b30 with PTS 9342648. 3 frames assembling and 1 out. The thread 0x4b1c has exited with code 0 (0x0). Compositor::recycleOutputFrame(135ac073b30) with MF 10MHz timestamp=9342666. 0 frames still out. The thread 0xd84 has exited with code 0 (0x0). D3D11: Removing Device. D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT] Compositor::RunOutputThread() error: cuGraphicsUnmapResources() failed with status 999. ``` The `Compositor` log messages are being printed by my Media Source code. From these logs, it appears that the first D3D errors which are reported are state problems with the D3D11 device context. Please help! Is there some kind of multithreaded texture / device sharing problem caused by something that I'm doing wrong in my decoder? Or has my Media Source object triggered a bug in AVProVideo or Unity?

I have written a Window Media Source object (registered DLL for Media Foundation framework) which parses omnidirectional media files, decodes and composites them with CUDA, and delivers Direct3D 11 media samples stored on the GPU to the Media Foundation player chain.

From a high level, the way that the system works is that my code creates a pool of very large (10k or 12k wide) D3D11 textures with RGBA pixels. I have threads which decode multiple HEVC streams using NVDEC/CUVID, where the output from each HEVC stream corresponds with a fixed smaller rectangle within the large output frame. When a new frame is started, a large output frame is taken from the D3D11 texture pool and it is mapped to a CUDA array using cuGraphicsMapResources(). As each sub-frame within the large output frame is decoded by the HEVC decoder, a CUDA function is called to convert the output from NV12 to RGBA and composite the sub-frame into the big output frame. When all of the pieces of the big frame have been completed, it is unmapped from CUDA, encapsulated in a Media Foundation media sample object, and then sent to the player via media foundation. When the player has finished using a particular big frame and it is released, my code gets a callback, and puts the frame back into the D3D11 texture pool for re-use.

So all of this works perfectly when my media source object is running with a simple media foundation test player application. But the final environment for which this was designed, includes Unity and a plugin called AVPro Video and SteamVR. In this environment, there are many more threads using D3D and the GPU, and when my decoder is used with this software, the system will occasionally (every 100-10,000 frames rendered) encounter a D3D error and result in a GPU crash / device removed scenario.

I used the D3D debug layer to observe that the first failures which occur are D3D problems, causing error messages like this:

D3D11 ERROR: ID3D11DeviceContext::Draw: Current Primitive Topology value (0) is not valid. [ EXECUTION ERROR #365: DEVICE_DRAW_INVALID_PRIMITIVETOPOLOGY]
D3D11 ERROR: ID3D11DeviceContext::Draw: A Vertex Shader is always required when drawing, but none is currently bound. [ EXECUTION ERROR #341: DEVICE_DRAW_VERTEX_SHADER_NOT_SET]
D3D11 ERROR: ID3D11DeviceContext::Draw: Rasterization Unit is enabled (PixelShader is not NULL or Depth/Stencil test is enabled and RasterizedStream is not D3D11_SO_NO_RASTERIZED_STREAM) but position is not provided by the last shader before the Rasterization Unit. [ EXECUTION ERROR #362: DEVICE_DRAW_POSITION_NOT_PRESENT]
...
D3D11: Removing Device.
D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT]

From these error messages, it appears to me that there is some race condition which causing the ID3D11DeviceContext to get corrupted. I don’t even create or use an ID3D11DeviceContext in any of my code anywhere, so this is a baffling problem. I worked to debug this problem by disabling different parts of my code to try and isolate the issue. And what I found was that the D3D device context failure does not happen if I don’t use the cuGraphicsMapResources() / cuGraphicsUnmapResources() functions. Obviously there is no work-around, and I need these functions in order to do what needs to be done, but for the sake of debugging I ran tests which definitively demonstrated that these function calls are the cause of the D3D failure.

I even ran an experiment to demonstrate that the problem was not caused by some other bug in the downstream media foundation player such as using my D3D11 texture after releasing it. I modified my D3D11 texture buffer pool to create 2 output textures (call them A and B) for each frame instead of only 1, and I did the CUDA resource mapping and unmapping on the A textures, but sent the B textures down the Media Foundation pipeline. Even in this case, the D3D device context errors will sometimes happen if I call the CUDA graphics mapping functions, but never if I don’t call them.

So from this evidence, it really looks like there is some side-effect of the cuGraphicsMapResources() / cuGraphicsUnmapResources() functions which is causing the D3D11 device context to occasionally become corrupted in another thread. Is it possible that the CUDA cuGraphicsMapResources function is internally retrieving the immediate D3D11 device context corresponding with the ID3D11Device and using it during the map/unmap operation? If so, that seems like a bug in CUDA, because the D3D11 device context is not thread-safe and needs to be protected with some kind of mutex.

richard42g · December 10, 2022, 7:55pm

Update: I eventually found a solution to this problem. The root cause is that some code in the CUDA functions (cuGraphicsMapResources() and cuGraphicsUnmapResources()) is using the ID3D11DeviceContext object for the immediate context and not locking it, so if some other thread is also using the immediate context for the same GPU, the object can trash itself.

To fix this, in my decoder I first wrote code to get the immediate context for the D3D11 device which is passed to my Media Source, and then retrieve the ID3D11Multithread interface from this context. Using this interface, I then checked if the multi-threaded protection was enabled, and if not, enabled it. This did not prevent the crashes, but then I further modified my decoder to keep a copy of the ID3D11Multithread interface and call the Enter() and Leave() lock methods around the cuGraphicsMapResaurces() and cuGraphicsUnmapResourcse() calls to protect the D3D11 immediate device context, and that solved the problems and prevented the crashes.

SasMaster · April 16, 2024, 12:24pm

Man… You saved me days of work. I was experiencing the same issue but with D3D11 surfaces. In my case though, enabling only the mt protection did the job. Thanks for posting the solution!

richard42g · September 25, 2024, 10:06pm

That’s great! I’m glad to hear that you found this forum post and it saved you some debugging time; that’s exactly why I posted all these details. That was a very different bug to track down.

Topic		Replies	Views
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	2864	April 28, 2018
How to use a Direct3D 2D texture as both a buffer for a render view as well as a CUDA resource? CUDA Programming and Performance	0	867	October 10, 2023
Error with cuD3D9ResourceGetMappedPointer() Error in Getting mapped pointer with cuD3D9ResourceGetMa CUDA Programming and Performance	1	1221	June 11, 2009
issue with cuda streams CUDA Programming and Performance	4	5374	August 25, 2011
CUDA 2.1 beta CUDA Programming and Performance	49	67162	December 3, 2008
Memory leaks in example simpleD3D9Texture CUDA Programming and Performance	1	5045	December 19, 2008
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134567	May 26, 2010
Handling a lost Direct3D9 device CUDA Programming and Performance	10	9700	January 14, 2009
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16258	September 29, 2017
sdk example output to file OptiX	22	4501	June 14, 2022

D3D11 device context in a separate thread gets corrupted when CUDA graphics resource mapping is used

Related topics