[windows] Possible vulkan driver bug with recent driver versions likely related to swapchain

I am experiencing a curious driver issue. When switching a vulkan application to/from fullscreen, a hang in the NV vulkan driver is observed, with the callstack pointing to win32u.dll. This isn’t unsurprising since the window is expected to change size, but what is curious/annoying is that whatever the routine is polling does not time out. And it causes every event on the device to also hang. This includes waiting for fences, etc just forever returns VK_NOT_READY. At the moment of this deadlock, there has yet to be a chance to call vkAcquireNextImageKHR to observe a surface OUT_OF_DATE event and deal with it which can easily leave the application in a hung state. What is frustrating is that once this happens, there is no way out of it. I even installed a wndproc hook and I can clearly see the message pump is running smoothly, so I have no idea what the driver could be waiting for. Another curiosity is that this issue only happens if you present on the primary display. On multimonitor setups, launching in the secondary display does not freeze like this. Ofc running the renderer off a secondary card works just as well and also never hangs. There is a bug nonetheless since calling getSwapchainImagesKHR on the swapchain whilst in this state will cause the process to turn into an unkillable zombie. It cannot finish termination either in task manager, TerminateProcess methods or taskkill /f as I get an access denied error, likely because windows is forever waiting on the driver to finish something. I can also provide code/builds that will take down the entire OS including keyboard hid drivers (LED switching stops working), just because present was called on an out of date surface. I can provide links to code and builds with these problems if needed.

External issue: Vulkan: Running games on Fullscreen causes a crash after a while · Issue #5351 · RPCS3/rpcs3 · GitHub

Looks like switching to FIFO mode from MAILBOX fixes this problem or at least makes it near impossible to trigger. While IMMEDIATE mode is hidden, forcing it as the present mode causes the problem to appear almost instantly going to/from fullscreen. Fullscreen is merely achieved by using SetWindowPos + SetWindowLong to change the border flags.

nVidia Team:
Is this issue related to this other one, that was fixed on Linux?

I managed to isolate the issue here; I’ll try to explain it just in case someone else encounters this problem.
The last commandbuffer submission before a flip event contains the transfer onto the acquired swapchain image. The straightforward way is to wait for a semaphore from the acquire step and signal a semaphore that will be used to wait in the present step. However, I did it differently, manually submitting frames to present independently. To synchronize, I was waiting for the submission fence to signal before call vkQueuePresent without a wait semaphore. This is the cause of the problem. Since completion of the commandbuffer is a signal operation, I believed it would be ok to then present immediately without waiting for another signal, but it seems the fence is signaled before the queued buffer is truly drained? The implementation was an attempt at application-side pacing although admittedly it never worked correctly even in the best of times.
Either way, the biggest issue here is that this problem completely destroys windows, requiring a power cycle to regain control. Sometimes even the mouse cursor does not work although HDMI audio does continue outputting sound. It’s also not possible to kill the misbehaving application at all, even through taskmanager or taskkill as it is stuck in an infinite loop within nvoglv64.dll. Attempting to do so throws an Access Denied error.
TL;DR - Don’t rely on the submission fence, just use a semaphore signal if you encounter this issue.

Hi kd-11,

We had tried to reproduce this issue before any mitigations were put into RPCS3, but were unable to.

I saw on the 2nd latest progress report that this issue has been completely worked around: https://rpcs3.net/blog/2019/07/31/progress-report-june-2019/#rsxframe . Let us know if any other issues related to this come up.