NVIDIA vulkan driver is crashing with code c0000409 on Geforce MX150 under Windows 10 on executing an command buffer

The NVIDIA vulkan driver is crashing with code c0000409 (stack overflow ???) on Geforce MX150 at nvoglv64!vk_icdNegotiateLoaderICDInterfaceVersion+0xc5019 under Windows 10 on executing an command buffer.

With Intel iGPUs and AMD GPUs it works without any problems.

Test program: outnvidia.zip (6.3 MB) (run the bin\gltf
test.exe then. The ZIP includes also all the shader source code. It happens at the multiview-capable skybox drawing. So the skybox*.* shader stuff would be the interesting stuff for you at the debugging. The “/fakedvr” cmdline parameter enables the Fake-VR mode and “/openvr” enables the real OpenVR-mode for to test it with multiview in real usage)

WinDbg Crash Log: nvidia_crash.txt (20.4 KB)

And you can find the full source code for it at GitHub - BeRo1985/pasvulkan at c4907be64fbb5957421fc12da75539857378869a (where pasvulkan/projects/gltftest/src/ is the sub-project location of the gltftest.exe )

Driver version: 472.12 64-bit

Screen capture video on Youtube of this issue: - YouTube

With driver version 496.13 it is still nearly the same, but now with DEVICE_LOST instead the exception code. And I’ve updated the ZIP with a new build of the example and it includes also a capture.rdc for RenderDoc now, where RenderDoc has also problems at replaying this capture on NVIDIA GPUs (at least on a MX150).

ZIP: outnvidia.zip (6.8 MB) (matches PasVulkan GIT repo tree version GitHub - BeRo1985/pasvulkan at b1bc21ba428653b22cec685c23a60c6f5e1934c3 )

Hello @rosseaux and welcome to the NVIDIA Developer forums!

Thank you for bringing this issue up with us. I forwarded all your information to our Vulkan experts and I will post here if I receive updates or further questions for you.

Markus

Thanks and here it is again a updated screen video capture together with a new ZIP NVIDIA Vulkan problems - YouTube and https://rootserver.rosseaux.net/stuff/nvidiac0000409/outnvidia.zip where I have also fixed all by the validation layers reported errors and warnings, but without any success regarding the problem with NVIDIA GPUs.

Update:

It “seems to be” a Optimus issue, because my PasVulkan GLTF stuff runs on my GTX1060 6GB without problems but not on my Geforce MX150 2GB on my Thinkpad T480 (but which is also Pascal like the GTX1060).

But the one computer with the GTX1060 has older drivers (456.71). I’ll update these later und recheck it then also again.

Ok with newer drivers it crashes also on my GTX1060. But it works with the older version 456.71 .

@MarkusHoHo i have done the further more detailed driver version bisecting, the result: it seems that 466.77 was the last good nvidia driver version where my vulkan stuff is not crashing and where it works, where from about 471.xx on it crashes. And with 466.77 it works even with my Geforce MX140 on my Thinkpad T480 (besides on my GTX1060 in a desktop computer of mine).

Edit: On my further research, it happens sometimes also with 466.77 with some few GLTF models, but it shows here that it can be something to do with VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT and VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT in connection with persistent mapping. When I disable persistent mapping at constant updating streaming-buffers (by mapping and unmapping at each streaming-buffer update then), then the crashe never happens with 466.77, but it still happens always with newer NVIDIA driver versions.

Edit 2: Hm okay, after a fresh reboot, the driver version 466.77 works also with enabled persistent mapping again. But it crashes still constantly at newer NVIDIA driver versions than 466.77 .

Edit 3: Another NVIDIA Vulkan driver issue (466.77), where at NVIDIA seems to have a coherent buffer sync problem when the main loop is paused for a short time, and thus also the continuous display of new frames, where the buffers then briefly contain partially wrong older data from a few frames earlier afterwards. => see Another NVIDIA Vulkan driver issue - YouTube

Thank you once again for keeping at it and updating with more details!

Right now I don’t have anything to share yet.

My nvidia graphics driver crash problem is now even weirder. I now have nvidia aftermath implemented. Without aftermath it crashes, but with aftermath loaded and enabled it runs without any crash, just as it should.

I’ve also updated the ZIP at https://rootserver.rosseaux.net/stuff/nvidiac0000409/outnvidia.zip which now includes following batch start files:

  • start_with_aftermath.bat for to start with Aftermath
  • start_without_aftermath.bat for to start without Aftermath
  • start_openvr_with_aftermath.bat for to start with Aftermath and OpenVR
  • start_openvr_without_aftermath.bat for to start without Aftermath but with OpenVR
  • start_fakedvr_with_aftermath.bat for to start with Aftermath and faked VR mode
  • start_fakedvr_without_aftermath.bat for to start without Aftermath but with faked VR mode

Otherwise, I would like to ask what the current status is on this and how I can help here to locate the exact cause of the problem.My feeling says that it probably has to do with either the descriptor sets or something to do with memory management, each on driver level.

At least it also should run with aftermath without any crashs on NVIDIA GPUs in the long run.

Edit: And sometimes, when aftermath is enabled, I got also something following maybe llvm-related stuff on the console output:

invalid vector, expected one element of type subrange
!998 = !DICompositeType(tag: DW_TAG_array_type, baseType: !999, size: 4, align: 32, flags: DIFlagVector, elements: !2)
invalid vector, expected one element of type subrange
!1082 = !DICompositeType(tag: DW_TAG_array_type, baseType: !999, size: 3, align: 32, flags: DIFlagVector, elements: !2)
invalid vector, expected one element of type subrange
!39 = !DICompositeType(tag: DW_TAG_array_type, baseType: !40, size: 4, align: 32, flags: DIFlagVector, elements: !2)
invalid vector, expected one element of type subrange
!39 = !DICompositeType(tag: DW_TAG_array_type, baseType: !40, size: 4, align: 32, flags: DIFlagVector, elements: !2)

I’ve again updated the ZIP at https://rootserver.rosseaux.net/stuff/nvidiac0000409/outnvidia.zip where I’ve added vkCmdSetCheckpointNV calls and reordered the descriptor set indices (so that the vkCmdBindDescriptorSets calls are almost always in descriptor set index ascending order now) and optimized unnecessary vkCmdBindDescriptorSets calls out. With the hope that it was because of that, but unfortunately that was not the case.

At least if I comment out either the skybox rendering or the mesh rendering portion in the forward render pass, it also no longer crashes with a device-lost error. And if aftermath is enabled, it also no longer crashes with a device-lost error then. etc. But otherwise without these. it crashes on NVIDIA GPUs with current drivers with device-lost errors.

Furthermore, all the Vulkan SDK validation layers including the gpu-assisted and synchronization stuff do find absolutely nothing either, so those don’t give any warnings or errors either.

But anyway, it still continues to crash on NVIDIA GPUs with current drivers with device-losts.

And these problems do not exist with with Intel iGPUs and AMD GPUs. This means that it works completely without any problems at Intel iGPUs and AMD GPUs.

I therefore ask for help or advice on how I can isolate this problem here, for example to be able to create a minimal test case from it. Because so far it was so when I try to implement this as a minimal test case, that it then does flawlessly, but still no more when it is in as a large and whole program construct. Therefore I suspect in the meantime that it is somehow probably a chain reaction bug on driver level.

Edit: As additional information, I have the following Vulkan capable GPUs in my ownership where I always test with them: NVIDIA Geforce MX150 2GB (in my Thinkpad T480), NVIDIA Geforce GTX960 2GB (in a desktop computer), NVIDIA Geforce GTX1050 4GB (in a Lenovo graogics Thunderbolt 3 Dock), NVIDIA Geforce 1060 6GB (in a desktop computer), AMD Raedon HD 7850 2GB (in a desktop computer), AMD Raedon R9 270 2GB (in a desktop computer), AMD Raedon RX580 8GB (in a desktop computer), Intel UHD Graphics 615 (in an Intel Core M3-7Y30 in my GPD Pocket 2), Intel UHD Graphics 620 (in an Intel Core i5 8250U in my Thinkpad T480), and various android devices (including a NVIDIA Shield TV and Exynos-based Samsung Galaxy S21).

I’ve created a C++ replay with NVIDIA Nsight, which crashes at me with DEVICE_LOST too (if without aftermath enabled):

C++ project download link: https://rootserver.rosseaux.net/stuff/nvidiac0000409/gltftest__2022_02_03__02_20_31.zip

NVIDIA Nsight Crash Minidump: https://rootserver.rosseaux.net/stuff/nvidiac0000409/eb7117ce-ed11-45d8-a873-8c7d4404d621.dmp

Hello once more!

I really admire your persistence with this issue, thank you very much!

Our internal tracker is also progressing and engineering is working on it as time permits. I cannot disclose much of our internal work, but I can tell you that engineering does recognize there to be a problem and has some lead of how to address it.

I will share your findings with engineering, I think the NSight replay will be especially useful. Let’s hope this will lead to a good solution soon.

Thanks again!

I’m glad to hear that. And many thanks so far.

If there is still anything somehow later I can do to help you guys address this issue, please let me know.

We have to thank you. If I have more questions I will reach out again.

Any update? At least I’ve tested the current drivers some time agos. The result:

With GT(X) 10xx GPUs it is still crashing, but no more always, but still very often at many GLTF models and certain (validator-valid) Vulkan buffer allocation situations.

But with RTX 30xx it is running fully flawless without any exception so far.

Therefore, it should be looked at again please more closely. I know, i unfortunately have a very low priority as an indie dev for you, but I’m starting to depend on this issue being completely fixed here, even for older GT(x) 9xx+10xx GPUs, since I’ll be making a financial living off of my Vulkan-based indie software product. I mean, one can’t force the users to newer gpus, only because of a bug in a driver. :-)

Hi @rosseaux ! Apologies for the long silence.

Engineering found a probable root cause related to Pascal generation HW specifically and have a fix. According to my information it should be in the latest drivers, so I would encourage you to update and re-try again.

Please let me know if this works or not.