D3D11 Problems to write Gpu Crashdumps with AM SDK 2019.1.2

Hi NV-Team,

We actually enhance the debug functions of our DX3D11 inhouse render engine. That includes the integration of aftermath.

Therefore I’ve tried to adapt the (Dx12)sample you’ve published on Github for our DX11 environment.

To test it, I reproduced an existing not solved “DXGI_ERROR_DEVICE_REMOVED error” of our engine but in ~50 trials only one time the AM triggers the callback to write a Crashdump.

Our application raises ~20 threads. Can it be that the Aftermath thread doesn’t get the correct prio before exiting the application ( The sample is doing a Sleep(3000) before exit(-1) is called ).

Is it possible to force Aftermath to write the crashdump ( when D3D raises the DXGI_ERROR_DEVICE_REMOVED event )

Any thoughts about it? Thx in advance.

René

Hello,

Please note that unfortunately D3D11 isn’t an officially supported API for Aftermath.

According to our engineer, it is hard to tell whether your analysis is correct or not. It does sound somewhat plausible, though. If you have many high priority threads, then the driver thread that listens for GPU events might starve.

Let us know if you still have questions.

Hi there,

Are there any reliable DX11 features in the current version of Aftermath beside the Crashdump feature ( GFSDK_Aftermath_GetData, GFSDK_Aftermath_GetDeviceStatus) we could use to track down the problem?

I tracked down our problem in our software. Now I know that it has someting to do with our data streaming with the destruction and construction of Gpu resources at runtime.

Before “Error: Device Lost: Reason code DXGI_ERROR_DEVICE_REMOVED [0x887A0005]” the MSD3D debug layer is “silent” as well as MS AppVerifier (no detectable cpu mem leaks). I am far away from any progression to solve this bug. :-(

Now my questions:

(1) Do you guys know or use alternative tools to hunt these kinds of error?
(2) Is there any possibility to get the nvidia symbols (pdb’s) or can I send you the mem.dmp right after DX throws the event (D3DDevice::RegisterDeviceRemovedEvent)
(3) Other ideas to progress bug hunting?

Thx in advance.

René

I forgot the following notes:

  • first call of GFSDK_Aftermath_GetDeviceStatus() after the crash tell me GFSDK_Aftermath_Device_Status_Unknown
  • the follow up calls of GFSDK_Aftermath_GetDeviceStatus() ofter the crash tell me GFSDK_Aftermath_Device_Status_DmaFault

Hello again,

we just find what causes the crash. It turns out that we have a call to CopySubresourceRegion1() which was accidentally called with the same source and destination location within the same (tiled resource, fully mapped) buffer. By excluding the case the device was never removed again.

In such situation it would be appropriate to have some direct notification by D3D Debug Layer. There wasn’t one. Instead we inserted a breakpoint where D3D set the DEVICE_REMOVAL_PROCESS_NOT_AT_FAULT (#380) error and dig into the dissassemly with the MS debug symbols ( function names ). There we learn that the problem relates to either UpdateSubResourceRegion(1) or CopySubresourceRegion(1). Jackpot.

Maybe this info help’s some other guys, who get into the same trap.

Best,
René