[windows] Possible vulkan driver bug with recent driver versions likely related to swapchain

kd-11 · January 7, 2019, 8:23pm

I am experiencing a curious driver issue. When switching a vulkan application to/from fullscreen, a hang in the NV vulkan driver is observed, with the callstack pointing to win32u.dll. This isn’t unsurprising since the window is expected to change size, but what is curious/annoying is that whatever the routine is polling does not time out. And it causes every event on the device to also hang. This includes waiting for fences, etc just forever returns VK_NOT_READY. At the moment of this deadlock, there has yet to be a chance to call vkAcquireNextImageKHR to observe a surface OUT_OF_DATE event and deal with it which can easily leave the application in a hung state. What is frustrating is that once this happens, there is no way out of it. I even installed a wndproc hook and I can clearly see the message pump is running smoothly, so I have no idea what the driver could be waiting for. Another curiosity is that this issue only happens if you present on the primary display. On multimonitor setups, launching in the secondary display does not freeze like this. Ofc running the renderer off a secondary card works just as well and also never hangs. There is a bug nonetheless since calling getSwapchainImagesKHR on the swapchain whilst in this state will cause the process to turn into an unkillable zombie. It cannot finish termination either in task manager, TerminateProcess methods or taskkill /f as I get an access denied error, likely because windows is forever waiting on the driver to finish something. I can also provide code/builds that will take down the entire OS including keyboard hid drivers (LED switching stops working), just because present was called on an out of date surface. I can provide links to code and builds with these problems if needed.

External issue: Vulkan: Running games on Fullscreen causes a crash after a while · Issue #5351 · RPCS3/rpcs3 · GitHub

kd-11 · January 31, 2019, 8:57pm

Looks like switching to FIFO mode from MAILBOX fixes this problem or at least makes it near impossible to trigger. While IMMEDIATE mode is hidden, forcing it as the present mode causes the problem to appear almost instantly going to/from fullscreen. Fullscreen is merely achieved by using SetWindowPos + SetWindowLong to change the border flags.

TheManuel1 · February 8, 2019, 3:39pm

nVidia Team:
Is this issue related to this other one, that was fixed on Linux?

kd-11 · June 10, 2019, 12:15pm

I managed to isolate the issue here; I’ll try to explain it just in case someone else encounters this problem.
The last commandbuffer submission before a flip event contains the transfer onto the acquired swapchain image. The straightforward way is to wait for a semaphore from the acquire step and signal a semaphore that will be used to wait in the present step. However, I did it differently, manually submitting frames to present independently. To synchronize, I was waiting for the submission fence to signal before call vkQueuePresent without a wait semaphore. This is the cause of the problem. Since completion of the commandbuffer is a signal operation, I believed it would be ok to then present immediately without waiting for another signal, but it seems the fence is signaled before the queued buffer is truly drained? The implementation was an attempt at application-side pacing although admittedly it never worked correctly even in the best of times.
Either way, the biggest issue here is that this problem completely destroys windows, requiring a power cycle to regain control. Sometimes even the mouse cursor does not work although HDMI audio does continue outputting sound. It’s also not possible to kill the misbehaving application at all, even through taskmanager or taskkill as it is stuck in an infinite loop within nvoglv64.dll. Attempting to do so throws an Access Denied error.
TL;DR - Don’t rely on the submission fence, just use a semaphore signal if you encounter this issue.

wpierce · September 10, 2019, 5:56am

Hi kd-11,

We had tried to reproduce this issue before any mitigations were put into RPCS3, but were unable to.

I saw on the 2nd latest progress report that this issue has been completely worked around: https://rpcs3.net/blog/2019/07/31/progress-report-june-2019/#rsxframe . Let us know if any other issues related to this come up.

Topic		Replies	Views
Severe user input lag in Vulkan on Windows Vulkan	5	360	July 26, 2024
Problems with VK_KHR_swapchain Vulkan	5	5088	September 30, 2018
Vulkan App with VK_PRESENT_MODE_FIFO_KHR (VSync) causes desktop stuttering across entire system when moving or resizing any window. (Linux/X11) Vulkan	12	8053	February 8, 2024
vkAcquireNextImageKHR ignoring timeout Vulkan	6	2344	July 19, 2017
Driver crash in vkQueuePresentKHR upon unplugging external HDMI display on Windows 10 Vulkan	5	1837	April 15, 2020
Hangs/Freezes when Vulkan v-sync (VK_PRESENT_MODE_FIFO_KHR) is enabled Linux	39	13657	January 11, 2021
364.19 Linux/X11 - Presenting from more than 2 queues causes hangs/VK_ERROR_DEVICE_LOST. Vulkan	8	2916	August 17, 2016
Vulkan developer beta driver 383.18(Win) & 381.26.20(Linux) Vulkan	2	3816	February 28, 2017
vkCmdBindShadersEXT call causes all rendering (including render pass clear) to fail when called with non-null shaders Vulkan	7	1042	May 4, 2023
"Failed to apply atomic modeset" + display hang on fullscreen game launch with 530.41.03, 1070 GTX, Wayland, Gnome 43.4, Manjaro Linux wayland	30	4486	November 10, 2023

[windows] Possible vulkan driver bug with recent driver versions likely related to swapchain

Related topics