Vulkan apps crash or freeze on VT switch

When I switch VT (virtual terminal), Vulkan applications either crash, or freeze. More annoyingly, this also happens when locking my screen via light-locker (which I believe switches VT). I am on Debian Testing, driver 525.105.17, NVIDIA RTX 3070.

The simplest app to test is “vkcube”, which comes from the package “vulkan-tools”. After VT switch, it crashes with an assertion failure (seems that some Vulkan call receives VK_ERROR_DEVICE_LOST).

Another app is Chromium with hardware acceleration enabled. Switching VT freezes it for like 15 seconds and then seemingly recovers (according to the console output, its GPU process crashes).

Next I tried some Vulkan games on Steam running via Proton, such as Doom Eternal. They all freeze indefinitely.

OpenGL apps do not suffer from this problem. Because of that and the fact I’ve not been able to find a single Vulkan app that survives a VT switch, I believe this may be an NVIDIA driver bug.

nvidia-bug-report.log.gz (304.2 KB)

EDIT: Some additional information that may or may not be helpful:

  • I only have one GPU (no integrated one).
  • The console seems to be driven by efifb. Perhaps there is some bad interaction between efifb and nvidia-driver.

Unlike OpenGL, which recovers from device mode switches behind the scenes in the driver, Vulkan handles them by returning VK_ERROR_DEVICE_LOST. Support for that is hit and miss in Vulkan apps in general, and I think vkcube is no exception. The version of vkcube that I have just hangs, so crashing on an assertion sounds like an improvement. What it’s supposed to do is recreate its device and continue, but it sounds like that’s not implemented.

You can avoid generating VK_ERROR_DEVICE_LOST on VT switches by enabling the NVreg_PreserveVideoMemoryAllocations=1 parameter to the nvidia kernel module. Please note that if that options is enabled, then suspending the system needs to go through the nvidia-suspend.service and nvidia-resume.service systemd units in order for the suspend sequence to work properly. From your bug report log, it looks like those services aren’t installed:

____________________________________________

/usr/bin/systemctl status nvidia-suspend.service nvidia-hibernate.service nvidia-resume.service nvidia-powerd.service

____________________________________________
1 Like

Thank you very much for your response! It is unfortunate that not even Chromium handles errors properly.

Anyway, I can confirm that the NVreg_PreserveVideoMemoryAllocations=1 parameter allows all of the applications to continue gracefully.

Interesting, I can’t find these services anywhere… It could be that Debian hasn’t packaged them. But oh well, I think I can live without suspend/resume for now.

By the way, I’ve chanced upon your post 216303 – Commit ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec broke legacy frame buffer with NVIDIA where you mention:

I’m looking at making the NVIDIA driver install its own framebuffer console in order to work around this problem, but that will take a little while to develop and get it into production.

Would this also solve the problem (along with others such as slow VT switching and low resolution)? Or would VK_ERROR_DEVICE_LOST still be generated?

No, that’s unrelated. The reason VT switching loses the device is because the X driver has to assume that when it’s not on the active VT, the system could suspend at any time. If video memory contents are lost during suspend, then the driver needs to recover anything that was in video memory after resume. For Vulkan, that necessitates a VK_ERROR_DEVICE_LOST. If video memory preservation is enabled, then the X driver knows that memory contents will stay where they are and can allow applications to just continue on as if nothing happened.

Whether the framebuffer console is driven by DRM or some other framebuffer console driver doesn’t make a difference here.