The possible root cause of the VK_KHR_present_wait (and vulkan related) freezes

Hi!

After spending months reverse engineering various parts of the nvidia linux driver stack, I believe I have finally found the actual root cause of some of the vulkan related freezes (at least on X, using the current production driver 570.133.07). But I feel like it might also potentially explain some freezes under Wayland. If I’m not mistaken, the VK_KHR_present_wait freezes (NV bug 4924590) are still “under investigation”.
Furthermore, I’ve created a simple proof-of-concept fix that solved all of the freezing issues I had with vulkan applications (including DXVK). Although, I have to mention that I’m still rocking a relatively old mobile 1050 Ti, and so I haven’t been able to fully validate my findings on other hardware / software configurations.

The root cause:

(This section is mostly for nvidia engineers, as I won’t define / go into much detail about how the various components and subsystems in the driver work.)
Let’s suppose that we are in an environment where all of the relevant X extensions are available (DRI3, present, sync), and that we are using DRI3 pixmaps for the presentation with “traditional” wait fences. libGLX_nvidia.so orchestrates the pixmap presentation using the X connection that was supplied during the surface creation (e.g. the xcb connection when using vkCreateXcbSurfaceKHR or the Xlib Display object when using vkCreateXlibSurfaceKHR). This means that the xcb_sync_await_fence() and xcb_sync_reset_fence() requests used to reset the wait fence for reuse before executing xcb_present_pixmap() are also sent on that channel. However, libGLX_nvidia.so seemingly also creates a default / fallback X connection (an Xlib Display object) when the library is initialized. It uses it, for example, to communicate with the NV-GLX extension of the X driver (nvidia_drv.so).
And now, for the problematic part: when writing the damage event into the ringbuffers (located in the shared memory region of the “damage manager” component) for a given present wait fence, the client issues an XSync() on this default X connection, instead of the one that was used for the presentation. Since the X driver uses the X server’s input handling mechanisms for the damage manager (i.e. xf86AddGeneralHandler()), this can lead to the following race condition if the server gets overwhelmed with events:

  • The clients queues the DRI3 pixmap presentation commands (including xcb_sync_await_fence() and xcb_sync_reset_fence()).
  • It writes (at least the first part of) the damage event, and then issues an XSync() on the default connection. (Since this connection isn’t used for other parts of the DRI3 presentation path, this won’t actually synchronize anything)
  • The X server consumes just the XSync() on this unused connection and sends a reply. As a result, the client continues.
  • Eventually, the kernel driver does its thing, and it sends the damage event “pulse” to the X driver.
  • Due to the lack of synchronization between the socket used for the presentation and the event manager’s fd, the damage event might be processed before the presentation’s xcb_sync_reset_fence(), leading to the premature triggering of the wait fence before it’s reset.
  • Hence xcb_present_pixmap() will never complete → the PresentIdleNotify event never arrives → the client will block forever in vkQueuePresentKHR()

My fix:

Located at: GitHub - vahook/nvglxfix
I wasn’t really sure, whether that XSync() call in the damage event part was a remnant of a legacy presentation path, a typo, or part of some other mechanism I haven’t discovered yet. Therefore, I opted not to touch it, instead I inserted a dummy xcb_get_input_focus() instruction right after xcb_sync_reset_fence(), and awaited it just after the xcb_flush() request that happens at the end of the DRI3 pixmap presentation. (XSync() also operates this way, as its xcb equivalent is roughly free(xcb_get_input_focus_reply(xcb_get_input_focus(c))).

Altogether, this has made my freezing issues go away completely, with absolutely zero performance impact. It also doesn’t matter whether or not I’m using a compositor (picom).

Update: Added code to the repo to reliably reproduce the bug (it just involves bombarding the X server with requests, my DXVK apps don’t last more than a second). Also, while browsing github and the forums, I came across this topic: Complete GPU crash on X11 with "Force Full Composition Pipeline" and VK_KHR_present_wait! 100% reproducible! with an issue tracked under 4174755. I think this might also be related. My guess it that the bug was first introduced around ~535 (I haven’t checked the drivers in the archive yet), and the different “timing characteristics” of the driver versions / system configurations made the issue more / less likely to appear.

3 Likes

That’s strange that you’re seeing those issues in X. Xwayland was working well for me. Its anything involving Gamescope or the Wine-Wayland driver where I had to disable VK_KHR_present_wait extension. My issues under wayland were resolved with the latest vulkan beta developer driver.

I believe that’s because XWayland has support for explicit sync, in which case libGLX_nvidia.so will use a different rendering path (involving xcb_present_pixmap_synced() instead of xcb_present_pixmap()) and DRI3 sync objects instead of the old XSyncFences.

I have also tested that (570.123.07) under X, but for me, the freezing issues remained. Nvidia didn’t really touch the standard X DRI3 rendering path in that update. And in fact, I could see the exact same race condition happen again.

As for Gamescope or Wine-Wayland, I can’t really comment on that, because I mainly use X. And thus I didn’t really look into the Wayland comms.

1 Like