Kernel-Mode Hang (possibly) when trying to reset the device

Hey guys,

So we’re totally abusing and misusing the Vulkan API in some way that the validation layers aren’t catching.
Disclaimer: There is one issue I’m aware of that’s reported by the validation layers (Descriptor in binding #x at GDI #y is being used in draw but has not been updated), but I don’t think it’s relevant to the issue.

As I (very poorly) understand it - we’re doing something that causes the GPU to be lost, and then NVIDIA’s driver steps in to try and recover the device, and hangs the systems of our users using the latest 397.64 drivers.
It’s almost impossible for me to generate a RenderDoc trace that reproduces the problem, because the issue is very inconsistent (and when it does happen, you can’t capture anything).

On top of this, I’m unable to reproduce the issue on AMD cards, whose drivers are much more picky and usually crash our program if we’re doing something bad.

Anyways, I’ve generated a couple of kernel-mode dumps using kd, as well as a few pictures of callstacks (which most likely don’t tell the entire story).
I can’t upload the dumps publicly (because they may contain sensitive information), but can supply them privately.
1)


2)

Hopefully an engineer can get in touch with me and give me specific instructions on how to help.