The possible root cause of the VK_KHR_present_wait (and vulkan related) freezes

vahook · April 1, 2025, 6:08pm

Hi!

After spending months reverse engineering various parts of the nvidia linux driver stack, I believe I have finally found the actual root cause of some of the vulkan related freezes (at least on X, using the current production driver 570.133.07). But I feel like it might also potentially explain some freezes under Wayland. If I’m not mistaken, the VK_KHR_present_wait freezes (NV bug 4924590) are still “under investigation”.
Furthermore, I’ve created a simple proof-of-concept fix that solved all of the freezing issues I had with vulkan applications (including DXVK). Although, I have to mention that I’m still rocking a relatively old mobile 1050 Ti, and so I haven’t been able to fully validate my findings on other hardware / software configurations.

The root cause:

(This section is mostly for nvidia engineers, as I won’t define / go into much detail about how the various components and subsystems in the driver work.)
Let’s suppose that we are in an environment where all of the relevant X extensions are available (DRI3, present, sync), and that we are using DRI3 pixmaps for the presentation with “traditional” wait fences. libGLX_nvidia.so orchestrates the pixmap presentation using the X connection that was supplied during the surface creation (e.g. the xcb connection when using vkCreateXcbSurfaceKHR or the Xlib Display object when using vkCreateXlibSurfaceKHR). This means that the xcb_sync_await_fence() and xcb_sync_reset_fence() requests used to reset the wait fence for reuse before executing xcb_present_pixmap() are also sent on that channel. However, libGLX_nvidia.so seemingly also creates a default / fallback X connection (an Xlib Display object) when the library is initialized. It uses it, for example, to communicate with the NV-GLX extension of the X driver (nvidia_drv.so).
And now, for the problematic part: when writing the damage event into the ringbuffers (located in the shared memory region of the “damage manager” component) for a given present wait fence, the client issues an XSync() on this default X connection, instead of the one that was used for the presentation. Since the X driver uses the X server’s input handling mechanisms for the damage manager (i.e. xf86AddGeneralHandler()), this can lead to the following race condition if the server gets overwhelmed with events:

The clients queues the DRI3 pixmap presentation commands (including xcb_sync_await_fence() and xcb_sync_reset_fence()).
It writes (at least the first part of) the damage event, and then issues an XSync() on the default connection. (Since this connection isn’t used for other parts of the DRI3 presentation path, this won’t actually synchronize anything)
The X server consumes just the XSync() on this unused connection and sends a reply. As a result, the client continues.
Eventually, the kernel driver does its thing, and it sends the damage event “pulse” to the X driver.
Due to the lack of synchronization between the socket used for the presentation and the event manager’s fd, the damage event might be processed before the presentation’s xcb_sync_reset_fence(), leading to the premature triggering of the wait fence before it’s reset.
Hence xcb_present_pixmap() will never complete → the PresentIdleNotify event never arrives → the client will block forever in vkQueuePresentKHR()

My fix:

Located at: GitHub - vahook/nvglxfix
I wasn’t really sure, whether that XSync() call in the damage event part was a remnant of a legacy presentation path, a typo, or part of some other mechanism I haven’t discovered yet. Therefore, I opted not to touch it, instead I inserted a dummy xcb_get_input_focus() instruction right after xcb_sync_reset_fence(), and awaited it just after the xcb_flush() request that happens at the end of the DRI3 pixmap presentation. (XSync() also operates this way, as its xcb equivalent is roughly free(xcb_get_input_focus_reply(xcb_get_input_focus(c))).

Altogether, this has made my freezing issues go away completely, with absolutely zero performance impact. It also doesn’t matter whether or not I’m using a compositor (picom).

Update: Added code to the repo to reliably reproduce the bug (it just involves bombarding the X server with requests, my DXVK apps don’t last more than a second). Also, while browsing github and the forums, I came across this topic: Complete GPU crash on X11 with "Force Full Composition Pipeline" and VK_KHR_present_wait! 100% reproducible! with an issue tracked under 4174755. I think this might also be related. My guess it that the bug was first introduced around ~535 (I haven’t checked the drivers in the archive yet), and the different “timing characteristics” of the driver versions / system configurations made the issue more / less likely to appear.

tlneondo · April 1, 2025, 6:52pm

That’s strange that you’re seeing those issues in X. Xwayland was working well for me. Its anything involving Gamescope or the Wine-Wayland driver where I had to disable VK_KHR_present_wait extension. My issues under wayland were resolved with the latest vulkan beta developer driver.

vahook · April 1, 2025, 7:14pm

I believe that’s because XWayland has support for explicit sync, in which case libGLX_nvidia.so will use a different rendering path (involving xcb_present_pixmap_synced() instead of xcb_present_pixmap()) and DRI3 sync objects instead of the old XSyncFences.

I have also tested that (570.123.07) under X, but for me, the freezing issues remained. Nvidia didn’t really touch the standard X DRI3 rendering path in that update. And in fact, I could see the exact same race condition happen again.

As for Gamescope or Wine-Wayland, I can’t really comment on that, because I mainly use X. And thus I didn’t really look into the Wayland comms.

Topic		Replies	Views
Vulkan App with VK_PRESENT_MODE_FIFO_KHR (VSync) causes desktop stuttering across entire system when moving or resizing any window. (Linux/X11) Vulkan	12	8600	February 8, 2024
Presentation in Latest Nvidia driver [545.29.02-4] appears to be bugged Linux	7	2857	April 22, 2024
Hangs/Freezes when Vulkan v-sync (VK_PRESENT_MODE_FIFO_KHR) is enabled Linux	39	13842	January 11, 2021
Display freezes: (EE) NVIDIA(GPU-0): WAIT Linux	28	9185	April 10, 2025
Glitches with nvidia driver 470 (optimus render offload) on xwayland with vulkan games Linux vulkan , wayland	25	9785	March 21, 2024
Vulkan/Wayland vkQueuePresentKHR waits for GPU to finish Linux	3	319	August 3, 2024
Inconsistent but frequent freeze caused by SteamVR Linux	32	7288	October 12, 2021
X hangs using 100% CPU, WAIT and mieq overflowing errors in logs Linux	67	23572	June 28, 2014
Take 3: Unable to share pixmap, random X crash Linux	5	4355	December 23, 2014
VK_KHR_present_{id,wait} causes device loss on Nvidia 525.60.11 on PRIME setup Linux nvbugs , vulkan , linux , linux-driver	5	2339	March 25, 2023

The possible root cause of the VK_KHR_present_wait (and vulkan related) freezes

The root cause:

My fix:

Related topics