Presentation in Latest Nvidia driver [545.29.02-4] appears to be bugged

Hello. I’m running KDE Plasma 5 via Wayland on Arch Linux x64, using a GTX 1070 Ti.

Recently, I updated to the latest NVIDIA driver (545.29.02-4).

Ever since, I have been experiencing completely unusable performance in a Vulkan-based application I’ve been developing (as in, it is constantly frozen and “not responding”, and if I click on a button drawn via imgui, it will not react to this for a minute or longer, if at all!), despite zero changes being made to my code. This occurs using both a clean build of my app and an older compilation which was made before the driver update.

It appears to be caused by a Nvidia driver bug.

In my attempts to debug this issue, I’ve paused my application while debugging it many times, and have noticed that it always (100% of the time) pauses on the same function: vkQueuePresentKHR. This indicates that the vast majority of my application’s runtime is being spent on this function call, which is unusual.

If I place a breakpoint on my application’s call to this function, and another on the code that immediately follows it, I can see that it takes about 7 seconds or more (!!!) for vkQueuePresentKHR to return, which is obviously very unusually slow, and very much seems to be the cause of my extreme performance issues.

Running perf on my application shows that it is spending the vast majority of its time in a single kernel function call: _nv040303rm. Googling this gives no results, but the “nv” obviously leads me to believe this is a function from the nvidia driver.

I’d like to note that this happens with both debug and release builds of my application, and it happens with or without the Vulkan validation layers enabled. Weirdly, however, it does not seem to occur in other Vulkan-based applications, such as games run via proton+dxvk? So, it must be something my application in particular is doing that triggers this bug. Seemingly even commenting out all of my update/rendering code so that my application literally only renders a clear color, and nothing else, still causes this to happen, however.

In addition to all of this, ever since the driver update, I’ve also been experiencing various “flickering” issues while using XWayland applications, such as Spotify, Discord, or Visual Studio Code.

It appears as though the contents of the applications’ windows are being presented to the screen, despite the rendering not yet actually being complete, resulting in various visual anomalies, such as some of the contents of the window not being shown on-screen.

I initially was experiencing these issues constantly, to the point where it was very frustrating. Now, for whatever reason, I’m suddenly unable to reproduce them. I’ll update this post with a link to a screen recording showing the issue if I manage to reproduce it again.

UPDATE: Here, I managed to reproduce it https://www.youtube.com/watch?v=wHVYkRuwYXc. Pay attention to the Spotify window around the 5 and 7 second marks.

These issues do not seem to occur at all in native Wayland applications. When this was happening often at first, forcing the Wayland backend in electron-based apps, such as Visual Studio Code, completely resolved the flickering issues. Unfortunately, not all apps have a native Wayland version available.

Putting all of this together, it appears to me that the cause of both of these problems I am experiencing is related to a bug regarding presentation which was introduced in the latest Nvidia driver.

I am not sure if making this forum post is the proper way to report this as a possible bug, or not. I will be happy to provide logs or any other assistance in debugging this issue if needed.

2 Likes

Update: My application uses GLFW. Today, I’ve discovered that replacing the glfw-wayland package with glfw-x11, and relaunching my application (I don’t even have to recompile), fixes the issue. DXVK/Proton games are also XWayland if I’m not mistaken, which explains why they aren’t affected by this extremely slow (7 seconds) vkQueuePresentKHR call issue.

I then built the vkcube sample (which uses Vulkan but does not use GLFW) as both a native Wayland application and an XWayland application. The native Wayland application does, in fact, exhibit the same issue as my application, and the XWayland application does not.

Finally, I built some Wayland OpenGL samples from Github and ran those. They were not affected, and ran smoothly.

All of this makes me think that it’s a presentation bug specifically related to the VK_KHR_wayland_surface extension, hence why both XWayland applications and native Wayland applications that are not using Vulkan are unaffected.

Prior to all of this, I came across this post regarding slow vkQueuePresentKHR on Nvidia and tried using each possible combination of supported surface formats and supported presentation modes. None of these resolved the issue (this user was on Windows anyway, but I thought it was worth trying).

I also tried using the disable_vrr_mclk_switch=1 option, as I saw some posts indicating there are some issues regarding VRR support in the new driver update. I don’t have a high-refresh-rate screen anyway; I’m using two screens with this machine: A 1080p/60Hz Samsung TV and an old 4:3 Dell VGA monitor (ah, the joys of being broke). None of them support refresh rates higher than 60Hz. But, I thought it was still worth trying this option.

Additionally, regarding the XWayland flickering issue, after researching it some more, I believe this problem is not technically a Nvidia driver issue. My understanding now is that Nvidia created patches for XWayland over a year ago to add support for explicit-synchronization, but these patches have not yet been merged. As a result, XWayland windows that dip below the screen refresh rate encounter this issue where the frames are sometimes presented out-of-order (meaning future frames that are still in the process of being rendered are displayed accidentally instead of the already-finished past frames). It’s up to the XWayland maintainers to merge these changes.

But vkQueuePresentKHR being extremely slow appears to be unrelated to this. This issue does still appear to be a Nvidia driver bug.

Finally, I’ve run nvidia-bug-report.sh and have attached the nvidia-bug-report.log.gz file to this post.
nvidia-bug-report.log.gz (1.1 MB)

Seems you still have the intel igpu active, can you disable that in bios to check if that has any influence on your issue?

Sure, good idea! I’ve disabled the iGPU in BIOS and tried it again using only my primary monitor, which is connected to the GTX 1070 Ti via HDMI. I’m still experiencing the same lag issue with native Wayland applications that use Vulkan. As before, native Wayland applications which use OpenGL do not experience this lag issue. And, the XWayland app flickering issue also still occurs.

I’d like to report that this issue still occurs exactly as described in 545.29.06-1

I am also experiencing the same slow vkQueuePresentKHR issue on Manjaro, linux66-nvidia 545.29.06-8 with mutter 45.1. From debugging it seems the time is spent inside poll() on a file descriptor named ‘anon_inode:sync_file’.
Both my application and vkcube-wayland freeze for 5 seconds in the first presentation until the window is shown with the first frame, after which it freezes again for 5 sec and then it runs fine. Additionally, my application crashes (vkQueuePresentKHR returns OutOfDate and wayland connection is closed) when receiving mouse events during that freeze period.

1 Like

The xwayland flickering has also been happening here(with xorgproto, wayland-protocols and xorg-xwayland patched with explicit sync) on 545, which doesn’t happen on 535.
Please find a video attached which shows the issue(it is on Gamescope on X11 session, the issue happens only with xwayland apps, hypnotix comes to mind which uses xwayland on wayland also triggers this same flicker).
Video_2023-12-23_10-39-46.zip (40.6 MB)

Possible relevant topics:

1 Like

I meant to post this several months ago, but apparently it never sent:

I saw that 550.40.07 was released on the beta branch, and this is in its changelog:

Fixed an issue that sometimes caused Wayland applications to run at less than one frame per second on Maxwell, Volta, and Pascal series GPUs.

Sure enough, I’m thrilled to report that this beta release appears to completely fix the Vulkan swap chain issue I described in this thread!

Huge thank you to the Nvidia Linux driver team!