Hi!
After spending months reverse engineering various parts of the nvidia linux driver stack, I believe I have finally found the actual root cause of some of the vulkan related freezes (at least on X, using the current production driver 570.133.07). But I feel like it might also potentially explain some freezes under Wayland. If I’m not mistaken, the VK_KHR_present_wait
freezes (NV bug 4924590) are still “under investigation”.
Furthermore, I’ve created a simple proof-of-concept fix that solved all of the freezing issues I had with vulkan applications (including DXVK). Although, I have to mention that I’m still rocking a relatively old mobile 1050 Ti, and so I haven’t been able to fully validate my findings on other hardware / software configurations.
The root cause:
(This section is mostly for nvidia engineers, as I won’t define / go into much detail about how the various components and subsystems in the driver work.)
Let’s suppose that we are in an environment where all of the relevant X extensions are available (DRI3, present, sync), and that we are using DRI3 pixmaps for the presentation with “traditional” wait fences. libGLX_nvidia.so
orchestrates the pixmap presentation using the X connection that was supplied during the surface creation (e.g. the xcb connection when using vkCreateXcbSurfaceKHR
or the Xlib Display object when using vkCreateXlibSurfaceKHR
). This means that the xcb_sync_await_fence()
and xcb_sync_reset_fence()
requests used to reset the wait fence for reuse before executing xcb_present_pixmap()
are also sent on that channel. However, libGLX_nvidia.so
seemingly also creates a default / fallback X connection (an Xlib Display object) when the library is initialized. It uses it, for example, to communicate with the NV-GLX extension of the X driver (nvidia_drv.so
).
And now, for the problematic part: when writing the damage event into the ringbuffers (located in the shared memory region of the “damage manager” component) for a given present wait fence, the client issues an XSync()
on this default X connection, instead of the one that was used for the presentation. Since the X driver uses the X server’s input handling mechanisms for the damage manager (i.e. xf86AddGeneralHandler()
), this can lead to the following race condition if the server gets overwhelmed with events:
- The clients queues the DRI3 pixmap presentation commands (including
xcb_sync_await_fence()
andxcb_sync_reset_fence()
). - It writes (at least the first part of) the damage event, and then issues an
XSync()
on the default connection. (Since this connection isn’t used for other parts of the DRI3 presentation path, this won’t actually synchronize anything) - The X server consumes just the
XSync()
on this unused connection and sends a reply. As a result, the client continues. - Eventually, the kernel driver does its thing, and it sends the damage event “pulse” to the X driver.
- Due to the lack of synchronization between the socket used for the presentation and the event manager’s fd, the damage event might be processed before the presentation’s
xcb_sync_reset_fence()
, leading to the premature triggering of the wait fence before it’s reset. - Hence
xcb_present_pixmap()
will never complete → thePresentIdleNotify
event never arrives → the client will block forever invkQueuePresentKHR()
My fix:
Located at: GitHub - vahook/nvglxfix
I wasn’t really sure, whether that XSync()
call in the damage event part was a remnant of a legacy presentation path, a typo, or part of some other mechanism I haven’t discovered yet. Therefore, I opted not to touch it, instead I inserted a dummy xcb_get_input_focus()
instruction right after xcb_sync_reset_fence()
, and awaited it just after the xcb_flush()
request that happens at the end of the DRI3 pixmap presentation. (XSync()
also operates this way, as its xcb equivalent is roughly free(xcb_get_input_focus_reply(xcb_get_input_focus(c)))
.
Altogether, this has made my freezing issues go away completely, with absolutely zero performance impact. It also doesn’t matter whether or not I’m using a compositor (picom).
Update: Added code to the repo to reliably reproduce the bug (it just involves bombarding the X server with requests, my DXVK apps don’t last more than a second). Also, while browsing github and the forums, I came across this topic: Complete GPU crash on X11 with "Force Full Composition Pipeline" and VK_KHR_present_wait! 100% reproducible! with an issue tracked under 4174755. I think this might also be related. My guess it that the bug was first introduced around ~535 (I haven’t checked the drivers in the archive yet), and the different “timing characteristics” of the driver versions / system configurations made the issue more / less likely to appear.