Inconsistent but frequent freeze caused by SteamVR

Under certain not fully known but common conditions, when SteamVR is running, the display in the HMD may or may not be corrupted, but in any case it will freeze, accompanied by the X server becoming completely unresponsive. In top, Xorg and some other applications (typically vrcompositor, the currently running game if any, and sometimes Discord for whatever reason) will be shown as spinning the CPU at 100%. I’ve found that I can usually recover the X session by repeatedly killing SteamVR processes (killing vrcompositor by itself doesn’t appear to be enough).

It’s extremely unlikely this is a hardware fault, as the same system can run GTA V in Proton without visible artifacts, performance issues, or crashes/hangs.

System specs and other useful info should be included in the nvidia-bug-report.log (once I figure out how to attach it). In particular, dmesg output leads me to believe this is a bug in the graphics driver itself.
nvidia-bug-report.log.gz (1.02 MB)

Has this only started with the v430 driver or has it also happened with v418?

I had upgraded from 418 to 430 when I ran into issues. Both have been unusably unstable.

Should I try going further backwards? 415? 410? 396?

Worth a try would be 410 or 390.

System would not fully boot under 410. Local console just showed a frozen Ubuntu splash, completely unresponsive. Thankfully I was able to SSH in and inspect the system remotely. It looks like X wouldn’t even start, or even gdm3 wouldn’t get to the point of spawning X before getting hung up on something. nvidia-bug-report hung during vulkaninfo, which did not respond to signals, not even SIGKILL. dmesg was… ugly. I was able to reinstall 430 via the remote shell, system would not shut down for a reboot but was back to normal after a forced reset. I’m attaching the partial NBR log regardless.

390 booted fine, but exhibits the same problem. Initially got a hang just starting SteamVR, physically disconnecting my second monitor seemed to fix this. But then I went on and reproduced one of the crashes that happened on 430, exactly, right down to the circumstances leading up to the crash and the precise corrupt pattern displayed in the HMD. Adding an NBR log from that as well, why not.

I’ve found the most consistent way to produce the crash is to launch Beat Saber (in Proton), start a song, and open the SteamVR system menu after about 30sec or so. It’s not 100%, but trying to close the system menu will usually immediately produce a nasty orange and green pattern in the left eye and all displays going completely unresponsive.
390.nvidia-bug-report.log.gz (521 KB)
410.nvidia-bug-report.log.gz (82.6 KB)

Just for the sake of argument, I tried the developer Vulkan beta 418.52.05. That one actually crashed when starting SteamVR, dmesg looked a bit different like it was a different bug. Perfectly reproducible though. I collected some notes on it, but I’m inclined to write it off as “beta being beta”.

I reverted to 430.09 just for the sake of running code that’s considered “stable”. As an experiment I left Beat Saber at the health and safety warning screen for about 5min, and couldn’t produce a crash with repeatedly opening and closing the SteamVR dashboard. Then I loaded a song and was able to reproduce the crash the same way as previously. So it’s not a matter of time, but something the game does when loading into or during gameplay sets up the driver’s internal state for failure.

In this thread on SteamVR forums:

He talks about a different issue that may or may not be driver related, but more importantly, for him to notice that issue it’s very likely he hasn’t been seeing this freeze. He has a Turing GPU on 418.74. The 418 driver that apt wants to install is .56, which is what I was testing previously. Based on the changelogs I doubt .74 has a fix, even accidental. Is this freeze possibly Pascal-specifc?

Did you already check if your HMD, especially the cabling is flawed by either using it on a different system or by installing Windows on yours?

This is already a dualboot system with Win10 on a separate SSD. There are no issues whatsoever with SteamVR or Beat Saber on the Windows side.

Things to try:

Removing the 4K monitor from the system changed the symptoms slightly, but still resulted in a freeze.

After trying to leave the dashboard, the game froze but the VR compositor was still responding (“waiting for” fallback scene) and I was able to reopen the dashboard. Trying to switch to desktop view in the dashboard caused the compositor to freeze. At this point I took off the HMD and found the desktop display frozen as well.

I was ultimately unable to save the X session, but killing Xorg recovered the system, dropping back to gdm3.

I’ll try updated Xorg shortly.

X 1.20 broke things pretty spectacularly.

Just trying to start SteamVR hangs the system. It still responds over network, at least initially.

First time I had both displays connected. While trying to recover the session, I saw my keyboard and mouse lights go out. dmesg says the XHCI controller stopped responding. About 30 seconds later the system stopped responding over the network as well.

Second time I disconnected the 4K display again. Still froze when trying to start SteamVR, but it didn’t seem to progress from there. I still couldn’t recover the local displays, even after killing both the session and GDM X servers. Tried to reboot system from console and it just hung.

I have a Radeon RX 480 laying around in a spare machine collecting dust. I’m very tempted to drop it into the system for testing, if nothing else to verify that it’s the Nvidia hardware and driver stack that causes or at least catalyzes the instability.

It took some changes to the setup to get SteamVR to initialize the HMD over the Radeon. I disconnected the side (not 4K) display for convenience, but the main fix was switching the HMD connection from DisplayPort to HDMI. At that point the game was perfectly stable, although not the most performant.

To be necessarily thorough, I switched back to the GeForce and replicated the changes. So, HMD connected over HDMI, 4K display only. Still got a freeze under the same circumstances, but there was no corruption in the HMD (image still frozen however), the desktop remained sort-of responsive, I was able to “Force Quit” the game from the desktop, and SteamVR, while not rendering anymore, appeared to shut down cleanly when asked.

This leads to two conclusions:

  1. Connecting the HMD over DisplayPort triggers unrelated issues;
  2. There is an Nvidia-specific problem that occurs even when connected over HDMI.

Now perhaps let’s play with different driver versions.

I’ve now tried 430.09, 418.56, Vulkan beta 418.52.05, and 410.104. The freeze is perfectly consistent. It is now less severe, and you can consistently recover the desktop session, but the application and VR compositor do not survive.

Reproduction steps:

  1. Launch SteamVR
  2. Launch Beat Saber in Proton (default settings when starting through VR dashboard or Steam desktop window)
  3. Click through health & safety warning and wait for main menu to load
  4. Open SteamVR dashboard
  5. Close SteamVR dashboard

This will cause an app freeze and responsiveness issues on the desktop, every time without fail. In less than a minute, the video in the HMD will freeze and the compositor will stop rendering, but you can still exit SteamVR cleanly from the desktop status window.

Again, an RX 480 with amdgpu/radv in an otherwise perfectly identical configuration, same PC, same OS install etc. does not have this issue.

If anyone is unable to reproduce please speak up. Hopefully we can either confirm a driver bug or narrow down what specific quirk of my system is making things go horribly wrong.
Looking at the compatibility reports, this must be something specific to your system as other people with nearly the same setup can run it without issue. Don’t know what’s happening, though.

You could also test what happens if you enable it for desktop use by adding to the device section of your xorg.conf

Option "AllowHMD" "true"

I’ve observed numerous freezes with similar symptoms in a variety of apps; Beat Saber is just the one I’ve nailed down to a very consistent test. I’m certainly open to experimenting to find similar triggers in other apps.

It bugs me that the crash cannot happen until after the health&safety screen. It’s not enough to throw control back to the app; there’s some kind of setup that’s breaking things. Something to do with shared buffers somehow?

I have some ideas to try, will experiment tomorrow.

I’ve had this issue on my RTX 2080, but not with the same drivers you are having trouble with. For me I first encountered it on 418.52.03 and was also able to duplicate it on 418.52.05, but 430.09 and 430.14 are stable for me (over 3 hours in Beat Saber with .09 and over an hour with .14).

Very similar symptoms though, the HMD display will freeze and get a random corrupted image on it, X11 will update my mouse for a few seconds but windows are unresponsive and then a few seconds later the entire thing will freeze up and my entire display will freeze up. I’m unable to even switch to full screen TTY’s and my system journal will show nvidia-drm kmod errors.

Hi Maladaptly,

Request you to please have a try with driver 430.14 and share results with us.
If it doesn’t work for you, I will try to reproduce issue internally.

430.14 shows a marked improvement, but still not truly fixed.

The hang isn’t 100% consistent anymore (but still happens most of the time) and localized to just the app and the compositor. Additionally, the compositor appears to crash and bail out instead of just freezing and the VR monitor window survives (showing a “-203” error). dmesg looks markedly different but it’s still quite clear the driver is unhappy.

At one point while testing, I got an interesting result. After closing the dashboard, the game continued running and rendering, but the eye images were completely in the wrong place in the HMD, as if the render/copy was going to the wrong offsets.

Attaching a new bug report log.

(Side note: things still go horribly wrong if a second monitor is connected, but that’s easily worked around and probably a separate bug anyway.)
nvidia-bug-report.log.gz (1.02 MB)