DX11: DXGI_ERROR_DEVICE_HUNG since 2080 TI (guess at certain tessellation stage)

rene.jan · September 9, 2020, 4:33pm

Hi NV Team,

We have problems with our DX11 engine running on NVIDIA’s 2080 TI series which we already integrated in certain customer projects by a large number. On 1080 TI and lower our engine running fine.
We already isolate the problem at certain tessellation stage and nail it down in a Nsight Snapshot (~50mb). There we have a situation where the frame is rendered approx. 10 minutes until we get an “DXGI_ERROR_DEVICE_HUNG”
There is from our point of view no wrongdoing with the D3D11/HLSL API. We really try many, many variations of code to track the crash further down with without success.

Is there a way that you take a look at the snapshot? Can I upload the data somewhere on your side? Or By Mail? Upload is not possible due to the size of the compressed archive.

Cheers,
René

rene.jan · September 9, 2020, 4:41pm

Further info:
We could reproduce the problem on a development machine with the same switch of graphics cards.
Using Aftermath we could narrow it down to single render call. We even have two different render paths in that case (DrawInstancedIndirect vs. DrawInstanced) which both crash.
Even after stripping down the scene to pretty much nothing except this render call and stripping down the used shader considerably, we still get the same crash.
What seems to prevent the crash is setting all the EdgeTessFactors in the hull shader to 1.0f, to prevent any tesselation.

We then took a C++ Capture with Nsight which can reproduce the crash. It might take a while until the crash happens, usually up to 15 minutes (but we had everything from 10 seconds to 2 hours).

We ran the debug version of the capture from within VS2019 with D3D debug layer active and the following arguments:
Application__2020_05_08__12_23_19.exe -automated -wb

The error output:
D3D11: Removing Device.
D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT]
Nvda.Replay Error: Present Failed with error 0x887a0005
Exception thrown at 0x00007FF846C3A799 in Application__2020_05_08__12_23_19.exe: Microsoft C++ exception: std::runtime_error at memory location 0x000000AED50FDD38.

D3D11: Removing Device.
D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT]
D3D11: **BREAK** enabled for the previous message, which was: [ ERROR EXECUTION #378: DEVICE_REMOVAL_PROCESS_AT_FAULT ]
Exception thrown at 0x00007FFA304FA799 (KernelBase.dll) in DisiApplication__2020_05_15__17_40_19.exe: 0x0000087A (parameters: 0x0000000000000001, 0x000000F4589895F0, 0x000000F45898B3E0).

We also tried different driver versions, from the first version supporting the 2080 Ti to the latest version and one in between, but all of them crash the same.
We also managed to find another system with a 2080 Super and that one ran a whole weekend without crashing.
When deactivate the TDR the system freezes and a reboot is required.

Thx in advance for any kind of idea how to tackle the problem.

Best,
René

Topic		Replies	Views
How to debug D3D11 device removal? DirectX, DXR, DirectCompute	1	4255	May 21, 2019
How to debug a DX11 GPU crash? DirectX, DXR, DirectCompute	0	1710	July 8, 2022
DXGI_ERROR_DEVICE_REMOVED because of DXGI_ERROR_DEVICE_HUNG Raytracing	1	1403	July 3, 2023
Nsight Graphics Access Violation Nsight Graphics	8	957	September 28, 2020
Possible Driver Bug General Topics and Other SDKs	2	560	October 5, 2023
Set of DX12 commands causing Device hang - series 3000 General Topics and Other SDKs	1	552	March 3, 2022
D3D12 driver is crashing when a compute shader is executed with a power of 2 numthreads DirectX, DXR, DirectCompute	0	4346	June 13, 2017
Nsight 5.5 graphics debugging crash in D3D11_41_D3D11::CD3D11ShaderObject Nsight Graphics	3	1099	March 2, 2018
D3D11 device context in a separate thread gets corrupted when CUDA graphics resource mapping is used CUDA Programming and Performance	3	1854	September 25, 2024
Transferring textures across adapters in DirectX 11 causes crash in nvwgf2umx.dll DirectX, DXR, DirectCompute	8	5256	November 3, 2014

DX11: DXGI_ERROR_DEVICE_HUNG since 2080 TI (guess at certain tessellation stage)

Related topics