Hi NV Team,
We have problems with our DX11 engine running on NVIDIA’s 2080 TI series which we already integrated in certain customer projects by a large number. On 1080 TI and lower our engine running fine.
We already isolate the problem at certain tessellation stage and nail it down in a Nsight Snapshot (~50mb). There we have a situation where the frame is rendered approx. 10 minutes until we get an “DXGI_ERROR_DEVICE_HUNG”
There is from our point of view no wrongdoing with the D3D11/HLSL API. We really try many, many variations of code to track the crash further down with without success.
Is there a way that you take a look at the snapshot? Can I upload the data somewhere on your side? Or By Mail? Upload is not possible due to the size of the compressed archive.
Cheers,
René
Further info:
We could reproduce the problem on a development machine with the same switch of graphics cards.
Using Aftermath we could narrow it down to single render call. We even have two different render paths in that case (DrawInstancedIndirect vs. DrawInstanced) which both crash.
Even after stripping down the scene to pretty much nothing except this render call and stripping down the used shader considerably, we still get the same crash.
What seems to prevent the crash is setting all the EdgeTessFactors in the hull shader to 1.0f, to prevent any tesselation.
We then took a C++ Capture with Nsight which can reproduce the crash. It might take a while until the crash happens, usually up to 15 minutes (but we had everything from 10 seconds to 2 hours).
We ran the debug version of the capture from within VS2019 with D3D debug layer active and the following arguments:
Application__2020_05_08__12_23_19.exe -automated -wb
The error output:
D3D11: Removing Device.
D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT]
Nvda.Replay Error: Present Failed with error 0x887a0005
Exception thrown at 0x00007FF846C3A799 in Application__2020_05_08__12_23_19.exe: Microsoft C++ exception: std::runtime_error at memory location 0x000000AED50FDD38.
D3D11: Removing Device.
D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT]
D3D11: **BREAK** enabled for the previous message, which was: [ ERROR EXECUTION #378: DEVICE_REMOVAL_PROCESS_AT_FAULT ]
Exception thrown at 0x00007FFA304FA799 (KernelBase.dll) in DisiApplication__2020_05_15__17_40_19.exe: 0x0000087A (parameters: 0x0000000000000001, 0x000000F4589895F0, 0x000000F45898B3E0).
- We also tried different driver versions, from the first version supporting the 2080 Ti to the latest version and one in between, but all of them crash the same.
- We also managed to find another system with a 2080 Super and that one ran a whole weekend without crashing.
- When deactivate the TDR the system freezes and a reboot is required.
Thx in advance for any kind of idea how to tackle the problem.
Best,
René