Driver crash in vkQueuePresentKHR upon unplugging external HDMI display on Windows 10

Hi,

I’ve experienced an NVIDIA driver crash on several Windows-10-powered PCs in an SDL2/Vulkan-based game that my company is developing and have been able to reproduce it almost every time under the following conditions:

  • Plug in an additional HDMI display device before starting your application, and set the display behaviour to Duplicate to have both of your display devices display the same application.
  • Start any Vulkan-based application in fullscreen-exclusive mode (i.e. one that creates its window with the WS_POPUP | WS_CLIPSIBLINGS | WS_CLIPCHILDREN style, and calls ChangeDisplaySettings or ChangeDisplaySettingsEx with the CDS_FULLSCREEN parameter – setting the new resolution to be different from your desktop resolution maximizes the odds to reproduce the issue in our tests). This issue cannot be reproduced if you start the application in windowed mode or borderless windowed mode. In order to avoid writing a Vulkan application from scratch just for testing, you may consider using one of the Vulkan examples kindly provided by Sascha Willems at https://github.com/SaschaWillems/Vulkan: passing the --fullscreen command-line argument starts the application in fullscreen-exclusive mode, and the -width and the -height command-line arguments can be used to specify the width and height of your window. The gears example was a good one for us to reproduce the issue.
  • Unplug the external display device you plugged in earlier.
  • At this point, depending on your hardware and configuration, different things can happen:

  • On my ASUS mobile gaming PC with a GTX 980M GPU and a G-sync 1920×1080 embedded monitor, this causes the application to become very choppy for about 2 seconds, and then freezes and causes a driver crash (not a BSOD, just a black screen followed by a heavily-corrupted image on the remaining display device). The driver is then restarted by Windows 10, although sometimes, this causes Windows 10 to just reboot. This behavior did not depend on the NVIDIA driver version (latest test was performed on the 441.87 version, if this is any relevant). When the driver is done restarting, most Vulkan functions called by the application return VK_ERROR_DEVICE_LOST, and attempts to recover properly are always unsuccessful: trying to tear down and then recreate the device causes the following to be displayed by the Vulkan validation layers:
    terminator_CreateDevice: Failed in ICD C:\WINDOWS\System32\DriverStore\FileRepository\nvami.inf_amd64_039a3b72bf87b399\.\nvoglv64.dll vkCreateDevicecall
    vkCreateDevice:  Failed to create device chain.
    
  • I've also tried to completely tear down the Vulkan context (including the VkInstance) and then create an OpenGL context when this occurred, and this caused our application to crash upon calling any glXXX function.
  • On my colleague's desktop PC with a GTX 650 Ti (driver version: 432.00) and a (rather old, non G-sync) 1360×768 screen, neither the application nor the driver will crash, but this causes the remaining display to flicker, alternating between a completely black screen and the expected output of the application. If your application is able to toggle between fullscreen-exclusive mode and windowed mode with a shortcut such as Alt+Tab, then using this shortcut may put this flickering phenomenon to an end, restoring the application to a "normal" state.
  • Our internal logs showed that the crash/unexpected behavior occurs in the vkQueuePresentKHR function (time logs indeed showed that 2-10 seconds are spent inside this function when the issue occurs), but the latest Vulkan SDK’s validation layers (version 1.1.30) do not output any error message, apart from the vkCreateDevice error message above.

    We tried to reproduce this issue with an OpenGL context instead of a Vulkan context, and all of our attempts have failed: unplugging the external display device does not cause this issue.
    We also tried to reproduce it with a Surface Pro 4 tablet PC (which only has an Intel iGPU), both with Vulkan and OpenGL, but the issue didn’t show up in this case either.

    All of the former have led me to believe this is an NVIDIA driver bug that is specific to Vulkan (maybe a presentation engine bug?).

    Also, not sure how relevant this is, but our tests were performed using a Samsung 4K TV as the external display device, and the present mode we used was VK_PRESENT_MODE_FIFO_KHR.

    Looking forward to your replies!

    Best,
    Alex

    Still getting this issue as of today.

    I no longer get a black screen on my gaming laptop as I used to, but I now get a BSoD with error PAGE_FAULT_IN_NONPAGED_AREA located in file nvlddmkm.sys (Windows 10 updates have probably changed the way such issues are handled by the OS, which, I believe, is the main reason why a BSoD is now triggered by this driver issue).

    For some reason, this crash does not occur when my application is run using the Visual Studio (2015, if any relevant) debugger (application behaves correctly, returning VK_ERROR_OUT_OF_DATE_KHR from the vkQueuePresentKHR function, and swap chain recreation works fine), but it does in all other conditions (running using the LLDB debugger, or running without any debugger). I guess the Visual Studio Debugger uses some kind of fault-tolerant heap here?

    I’ve tried a lot of workarounds (like calling vkDeviceWaitIdle every frame to prevent potential synchronization issues, or using functions from the VK_EXT_full_screen_exclusive device extension), but none of them has changed anything.

    NVIDIA, please consider taking a look at this issue. This driver crash still occurs with Sascha Willems’ examples, and Sascha Willems’s examples are considered as an almost authoritative reference on how to use Vulkan. Or, at least, please provide a workaround to avoid hitting the buggy code path in the driver.

    EDIT: just tested some Vulkan free and open-source software engines/games (vkQuake for instance) and they are all affected by this issue :/

    Vulkan driver team is looking into this issue. If you could provide some additional system information it would be helpful to match your setup.

    1. Can you share your dxdiag?
    2. Can you share your Windows crash dump (minidump) from the BSOD?
    1 Like

    Hi Wen_Su,

    Thank you for investigating this issue.
    Please find the logs you requested in this Dropbox folder.

    This folder contains :

    • a dxdiag file named DxDiag-With-External-Monitor.txt, generated approx. 1 hour ago right after plugging my HDMI external monitor (the Samsung 4K TV used in my repro cases).
    • the BSoD minidump file (041420-26703-01.dmp), generated from a repro case triggered after generating the dxdiag file ;
    • a second dxdiag file named DxDiag-After-Crash.txt, generated after rebooting my computer back from the BSoD (external monitor is no longer plugged at the time of generating this file).

    Feel free to request additional logs if necessary.

    Best,
    Alex

    utalex,

    The crash dump is helpful. We are able to pin point the crash location. Thank you for providing the details. Driver team is working on it now. I hope to bring an update once the bug is resolved.

    1 Like

    Those are great news, thank you Wen_Su!

    Looking forward to testing this update!