How to debug a DX11 GPU crash?

Hello, Internet hive mind.

I have a problem with a GPU crash that I’ve, so far, been unable to figure out. Maybe some of you lovely people have some suggestions?

Our application is split over a client/server architecture. (Same binary, different command line arguments.) In this particular configuration, the server decodes eight 2160p video streams, processes them, and encodes a single video stream (resolution to match the client window size) and transmits that to the client which (among other things) displays it.

We’re using the nvidia encoder and decoder SDKs for handling the video, and DX11 on Windows and OpenGL on Linux for other tasks.

On Linux with OpenGL everything works as expected.

On Windows (10 with latest drivers - RTX2080 GPU) with DX11 everything works as expected - as long as the server window isn’t maximised when the client starts.

I can happily run the server forever. (It just shows a similar - but independent view - as the client.) So when the client is connected, the server will be rendering two views, and encoding one for the client.

I can happily run the server and client on Windows with unmaximised windows and everything is fine. However, if the server window is maximised when a client connects (and this client can be running locally on the same machine or remotely on Linux) I encounter the following GPU crash on the server:

D3D11: Removing Device.
D3D11 ERROR: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). [ EXECUTION ERROR #378: DEVICE_REMOVAL_PROCESS_AT_FAULT]

If the server window isn’t maximised (I suspect there’s actually a cut off point related to the size of the window, not the actual maximise state itself) and a client starts everything works. I can even maximise all windows after start up and everything is fine.

When running the server under NSight Graphics, I get the following rather cryptic (and possibly unrelated) error:

A query for ‘ID3D11VideoDevice (10EC4D5B-975A-4689-B9E4-D0AAC30FE333)’ on an instance of ‘ID3D11Device5’ succeeded in the runtime but was force-failed due to an incompatibility with Nsight. Were this query to proceed, it likely would have caused unintended problems, including a possible crash. If a crash is encountered subsequently to this message, please investigate the unsupported query as a likely source of error.

We have no references to ID3D11VideoDevice in our code, so that’s very much nvdec/nvenc territory.

So, how can I debug this? I can’t use Nsight Aftermath because that doesn’t support DX11. Any suggestions?