Rare crash deep inside vkQueueSubmit

We’re generally running very stable, we can force an unload of meshes/textures/shaders/materials, and everything will start reloading asynchronously while the rendering continues. We don’t get any issues/crashes no matter how hard we try that.

But we’ve got a scenario in our game in which we randomly get a crash after 1-20min of playing.
We’ve not been able to reproduce the issue with AMD, nor with nvidia on linux.
We’ve had the issue both on a 980ti, and a Titan X (not sure which one…).

It’s always with the same callstack at a random call to vkQueueSubmit() (we’ve got about 4 in a frame).
We’ve tried many things to narrow down what’s responsible for this (not deleting objects, not freeing memory, etc), but unfortunatelly at best the reproducibility decreased.
It also happens with validation layers on and they don’t report anything before the crash happens.

We’d greatly appreciate if someone with access to the symbols could investigate, and at the very least give us a hint what it could be related to.
E.g. BOs, textures, descriptorsets, synchronization, barriers, use after delete, memory corruption.

The callstacks we got for the 378.66 drivers are:
nvoglv64.dll!000000005d60dd38() Unknown
nvoglv64.dll!000000005d23f88d() Unknown
nvoglv64.dll!000000005d28a446() Unknown
nvoglv64.dll!000000005d288b42() Unknown
nvoglv64.dll!000000005d277fc0() Unknown
nvoglv64.dll!000000005d379a0e() Unknown
nvoglv64.dll!000000005d37b4d4() Unknown
nvoglv64.dll!000000005d37a26a() Unknown
nvoglv64.dll!000000005d37c30b() Unknown
nvoglv64.dll!000000005d390439() Unknown
nvoglv64.dll!000000005d23bfd7() Unknown
nvoglv64.dll!000000005d354cb2() Unknown
nvoglv64.dll!000000005d31cdc2() Unknown
nvoglv64.dll!000000005d37acf7() Unknown
nvoglv64.dll!000000005d596c7e() Unknown
nvoglv64.dll!000000005d57c9c9() Unknown
Game.exe!S3D::Vulkan::Queue(const XGfx::CommandBuffer & cb, XGfx::DeviceQueue queue) Line 1742
nvoglv64.dll!0000000063e0dd38() Unknown
nvoglv64.dll!0000000063a3f88d() Unknown
nvoglv64.dll!0000000063a8a446() Unknown
nvoglv64.dll!0000000063a88b42() Unknown
nvoglv64.dll!0000000063a77fc0() Unknown
nvoglv64.dll!0000000063b79a0e() Unknown
nvoglv64.dll!0000000063b7b4d4() Unknown
nvoglv64.dll!0000000063b7a26a() Unknown
nvoglv64.dll!0000000063b7c30b() Unknown
nvoglv64.dll!0000000063b90439() Unknown
nvoglv64.dll!0000000063a3bfd7() Unknown
nvoglv64.dll!0000000063b54cb2() Unknown
nvoglv64.dll!0000000063b1cdc2() Unknown
nvoglv64.dll!0000000063b7acf7() Unknown
nvoglv64.dll!0000000063d8f7b6() Unknown
nvoglv64.dll!0000000063d8de6d() Unknown
nvoglv64.dll!0000000063d96bd2() Unknown
nvoglv64.dll!0000000063d7c9c9() Unknown
VkLayer_unique_objects.dll!00007ffc8394e763() Unknown
VkLayer_core_validation.dll!00007ffc829745c5() Unknown
VkLayer_object_tracker.dll!00007ffc81edca50() Unknown
VkLayer_parameter_validation.dll!00007ffc81d5b392() Unknown
VkLayer_threading.dll!00007ffc847be746() Unknown
Game.exe!XGfx::VideoBaseClass::BE_ThreadedRenderFinish() Line 273
nvoglv64.dll!0000000062d8dd38() Unknown
nvoglv64.dll!00000000629bf88d() Unknown
nvoglv64.dll!0000000062a0a446() Unknown
nvoglv64.dll!0000000062a08b42() Unknown
nvoglv64.dll!00000000629f7fc0() Unknown
nvoglv64.dll!0000000062af9a0e() Unknown
nvoglv64.dll!0000000062afb4d4() Unknown
nvoglv64.dll!0000000062afa26a() Unknown
nvoglv64.dll!0000000062afc30b() Unknown
nvoglv64.dll!0000000062b10439() Unknown
nvoglv64.dll!00000000629bbfd7() Unknown
nvoglv64.dll!0000000062ad4cb2() Unknown
nvoglv64.dll!0000000062a9cdc2() Unknown
nvoglv64.dll!0000000062afacf7() Unknown
nvoglv64.dll!0000000062d0f7b6() Unknown
nvoglv64.dll!0000000062d0de6d() Unknown
nvoglv64.dll!0000000062d16bd2() Unknown
nvoglv64.dll!0000000062cfc9c9() Unknown
Game.exe!XGfx::VideoBaseClass::BE_ThreadedRenderFinish() Line 273
For the 377.01 driver we got:
nvoglv64.dll!000000005dc63724() Unknown
nvoglv64.dll!000000005d228d62() Unknown
nvoglv64.dll!000000005d272c95() Unknown
nvoglv64.dll!000000005d271332() Unknown
nvoglv64.dll!000000005d2605aa() Unknown
nvoglv64.dll!000000005d35f7ff() Unknown
nvoglv64.dll!000000005d3612a4() Unknown
nvoglv64.dll!000000005d360056() Unknown
nvoglv64.dll!000000005d36214b() Unknown
nvoglv64.dll!000000005d376089() Unknown
nvoglv64.dll!000000005d22574c() Unknown
nvoglv64.dll!000000005d33c50f() Unknown
nvoglv64.dll!000000005d303f76() Unknown
nvoglv64.dll!000000005d360ad7() Unknown
nvoglv64.dll!000000005d56b806() Unknown
nvoglv64.dll!000000005d569f27() Unknown
nvoglv64.dll!000000005d573412() Unknown
nvoglv64.dll!000000005d556ef9() Unknown
Game.exe!XGfx::VideoBaseClass::BE_ThreadedRenderFinish() Line 273

Currently we do get these validation errors, but quite a few of them every frame, so I dare say it’s unlikely they’re the cause.

Vulkan ERROR[DS]: ‘Descriptor set 0x6241 encountered the following validation error at vkCmdDraw() time: Descriptor in binding #22 at global descriptor index 21 requires an image view of type VK_IMAGE_VIEW_TYPE_CUBE but got VK_IMAGE_VIEW_TYPE_2D.’ flags = 0x8 objectType = 23 object = 0x0000000000006241 location = 3081 messageCode = 59 pUserData = 00007FF6841D5990
Vulkan ERROR [SC]: ‘VS consumes input at location 7 but not provided’ flags=0x8 objectType=3 object=0x(nil) location=1719 messageCode=3 pUserData=0x3053b98
Vulkan WARN [SC]: ‘FS writes to output location 1 with no matching attachment’ flags=0x2 objectType=0 object=0x(nil) location=1775 messageCode=2 pUserData=0x32cd8b8
Vulkan PERF [SC]: object: 0 type: 0 location: 1748 msgCode: 2: Vertex attribute at location 2 not consumed by VS

Well we’ve found two artificial ways to trigger this crash:

  1. DescriptorSet referencing a VkImageView of a VkImage that was deleted
  2. IndexBuffer with an index going way out of the VertexBuffer range

So my understanding is that this can basically happen on any invalid gpu memory access, and is basically a “device-lost” event that happens to be terminal instead of returning the error?
Can anyone from nvidia at least confirm this assessment?
There were some artificial situations where we did get an actual DEVICE_LOST error, so I would really appreciate some clarification about the expected behaviour for any such issues.

I’m not from NVIDIA, but the expected behavior is pretty clear: If you violate the spec in any way, shape or form you get undefined behavior. That is, the driver is allowed to do anything, including crashing your app randomly. So your app crashing falls into expected behavior. Returning DEVICE_LOST is also valid, but not required. So is returning a garbage return value, rendering crap etc.

Your first case seems to be in violation if there are cmd buffers in flight that reference the deleted VKImage, but that is the least conservative reading of the spec. Your second case is only a violation IFF you didn’t request the robustBufferAccess feature. The index must still be less than or equal to the maxDrawIndexedIndexValue limit thought.

DEVICE_LOST can actually happen at any time even if your program is 100% correct, the most likely one being another program crashed the driver/GPU. If another program crashes the GPU, the vulkan driver in your app MUST NOT crash, but may return DEVICE_LOST on any call to indicate to you that the GPU is gone, like device memory was reset etc. You also would get a DEVICE_LOST if you hanged the GPU and the OS reset the driver (2s for a single shader execution under standard windows).