Nvidia driver segfaults on Steam restart. Occurs with driver versions >= 435.21 and multiscreen configuration

elven.thief · December 19, 2019, 4:08am

I am using a GeForce GTX 1060 6GB in a 3 monitor configuration as follows:

Each monitor hosts a separate screen on the same X server (they are addressable as DISPLAY=:0.0 DISPLAY=:0.1 DISPLAY=:0.2)

:0.0 is the Left display with a resolution of 1920x1200 - HDMI-0 connection
:0.1 is the Center display with a resolution of 1920x1080 - HDMI-1 connection 
:0.2 is the Right display with a resolution of 1080x1920 (rotated to portrait mode). DV-0 connection

I am not using Twinview or Xinerama. Each display has their own separate desktop instance with multiple virtual desktops per monitors (Fluxbox as WM).
This means that I can individually resize their resolutions, but can't drag windows across the screens. 
Fullscreen gaming will only occupy 1 monitor instead of all 3.

After an nvidia driver upgrade earlier this year, I started noticing sporadic crashes after running the Linux Steam client.

I’ve narrowed it down to the following reproduction steps:

Start Steam on the :0.0 Display (1920x1200). I’m currently letting it auto-login, so I let it get to the main store window.
Exit Steam
Start the Steam client a second time on the same display. Before it successfully renders a loading or login screen, the below X.org crash occurs in the nvidia_drv.so

[ 320.061] (EE)
[ 320.062] (EE) Backtrace:
[ 320.062] (EE) 0: /usr/bin/X (xorg_backtrace+0x4d) [0x55ed7c346a9d]
[ 320.062] (EE) 1: /usr/bin/X (0x55ed7c1a2000+0x1a8755) [0x55ed7c34a755]
[ 320.062] (EE) 2: /lib64/libpthread.so.0 (0x7f1b19d26000+0x14500) [0x7f1b19d3a500]
[ 320.062] (EE) 3: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7f1b1830a000+0x4c0c2c) [0x7f1b187cac2c]
[ 320.062] (EE)
[ 320.062] (EE) Segmentation fault at address 0x5df9c7dd
[ 320.062] (EE)
Fatal server error:
[ 320.062] (EE) Caught signal 11 (Segmentation fault). Server aborting
[ 320.062] (EE)
[ 320.062] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
[ 320.062] (EE) Please also check the log file at “/var/log/Xorg.0.log” for additional information.
[ 320.062] (EE)
[ 320.062] (EE)
[ 320.062] (EE) Backtrace:
[ 320.062] (EE) 0: /usr/bin/X (xorg_backtrace+0x4d) [0x55ed7c346a9d]
[ 320.062] (EE) 1: /usr/bin/X (0x55ed7c1a2000+0x1a8755) [0x55ed7c34a755]
[ 320.062] (EE) 2: /lib64/libpthread.so.0 (0x7f1b19d26000+0x14500) [0x7f1b19d3a500]
[ 320.062] (EE) 3: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7f1b1830a000+0x4c0c2c) [0x7f1b187cac2c]
[ 320.062] (EE)
[ 320.062] (EE) Bus error at address 0x0
[ 320.062] (EE)
FatalError re-entered, aborting
[ 320.062] (EE) Caught signal 7 (Bus error). Server aborting
[ 320.062] (EE)

The 1st error (segfault) always occurs in the logs and always at the same address (0x4c0c2c for the 440.44 driver).
The 2nd error (bus error) does not always occur in the logs, but I suspect it’s deterministic on what was left in memory when I restarted Steam.

When the driver segfaults, I am able to ssh into the box and restart X, so it’s not causing a kernel panic.

Additional system and software details:

X.Org X Server 1.20.5
X Protocol Version 11, Revision 0
Current Operating System: Linux 5.4.3-gentoo #1 SMP PREEMPT Sat Dec 14 16:44:02 CST 2019 x86_64

I’ve narrowed down the conditions to trigger this crash as follows:

*Condition 1:
Driver version must be > 430.xx series. I’ve reproduced on 435.21 and am currently running 440.44. I just downgraded and tested 430.64 and cannot trigger this crash. I’ve also been running this box for over 2 years and continually keeping the nvidia driver up to date. I never experienced the segfault until I upgraded to the 435 series a few months ago.

*Condition 2:
This segfault only occurs on the 1920x1200 display. I tested restarting Steam on the other 2 displays and could not reproduce the crash.

*Condition 3:
This segfault only occurs when there are multiple monitors active. I tested restarting Steam on the 1920x1200 display as a single monitor and was not able to reproduct the crash.

*Condition 4:
This segfault only occurs on the 1920x1200 monitor if it’s in this native resolution. I can “downgrade” it to 1920x1080 and the crash cannot be reproduced on the same display.

Things that don’t appear to matter:

* Rearranging the monitor layout has no effect. Steam restarts will always trigger an X crash on the 1920x1200 display no matter what monitor is considered at 0 0 in virtual space.

* Switching HDMI connectors has no effect. I can trigger the crash on the 1920x1200 display even if it's on HDMI-1

* Going from 3 to 2 monitors doesn't make a difference. Steam restart on 1920x1200 display will still trigger the crash and can't be reproduced on either of the other display.

* Kernel or GCC version. I know Gentoo's reputation and I don't overconfigure optimzations. I've experienced the crash on multiple different Kernels in 5.x.x series. I've also moved from GCC8 to GCC9 and rebuilt X.org and related libraries. The crashes only appear to show up in the nvidia drivers after 430 series regardless of Kernel or GCC version. I haven't had any other software crashes occurring in drivers or applications on this box.

Further notes:

If I start Steam on the “problem” monitor and exit it without restarting, the nvidia driver will eventually crash later.
This usually occurs if I’m opening or closing a youtube video in Chrome on the middle monitor. This may take minutes or hours, but eventually does occur.
I haven’t thoroughly tested this behavior to reproduce as above, but it heavily implies to me that something Steam is doing is not getting properly reset in the driver when it’s closed and some memory is getting leaked or clobbered.

Hopefully this is enough information to reproduce or narrow down a regression. If I had access to the driver code, I would bisect 430 and 435 branches to see if there’s anything that might have broke due to bad assumptions on monitor resolutions in multimonitor configurations.

My gut instinct is there is likely some kind of memory corruption in the driver occuring when releasing Steam’s allocated resources, but only if the monitor its on has a “weird” resolution.

I’ve preemtively attached a bug report as well. It was produced on 440.44 driver on an ssh session.
nvidia-bug-report.log.gz (47.7 KB)

pemt512 · March 17, 2020, 9:09pm

Hi elven.thief and Nvidia,

I suffer from what seems to be the exact same problem. My configuration is using two separate X screens on a Geforce 2060. Primary X screen has the resolution of 1920x1200 and second X screen is using 1920x1080.

System info:

XOrg X Server 1.20.5
X Protocol Version 11, Revision 0
Ubuntu 18.04.4 LTS
kernel 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64
Nvidia driver 440.64 from Proprietary GPU Drivers : “Graphics Drivers” team

Log from last crash triggered by exiting and starting Steam on primary X screen:

[212266.755] (EE)
[212266.755] (EE) Backtrace:
[212266.755] (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x139) [0x561f903647d9]
[212266.755] (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x50) [0x7f760f1518df]
[212266.756] (EE) 2: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x4569ec) [0x7f760c610cd8]
[212266.756] (EE)
[212266.756] (EE) Segmentation fault at address 0xad
[212266.756] (EE)
Fatal server error:
[212266.756] (EE) Caught signal 11 (Segmentation fault). Server aborting
[212266.756] (EE)
[212266.756] (EE)
Please consult the The XOrg Foundation support
at http://wiki.x.org
for help.
[212266.756] (EE) Please also check the log file at “/var/log/Xorg.1.log” for additional information.
[212266.756] (EE)

Best regards

ea.rose.0 · June 5, 2020, 11:06am

Hi,
GTX1660 6GB
440.82 (-r3: Gentoo)
5.4.38-gentoo
X Server 1.20.7
Zen2

Exactly the same issue with dual monitor setup. But more info to add:
:0.0 = 1920x1080 (default res).
:0.1 = 1280x1024 (default res).

Steam started and exited. Either a) restart steam, b) ctrl-alt-+, c) suspend to ram - will cause the nvidia_drv.so / libpthread.so segfault. Running an OpenGL game (UT2004 - DVD) doesn’t cause any issues.
This will only occur on the monitor with the higher resolution (in either, not both, axis).
Moving :0.0 to 1280x720 either during an active X session, or setting as the main resolution to run at then the problem doesn’t manifest on this monitor - it will now manifest on the :0.1 display. So, it is resolution related, and not either ViewPort/configuration related.

Setting the monitors to:
:0.0 = 1600x900
:0.1 = 1280x1024
Will allow the issue to manifest on either monitor as the both have a higher resolution in one direction.

I’ve not posted any logs as it’s nothing different to the above.

elven.thief · June 5, 2020, 10:49pm

So at some point, I was able to recreate similar to your b) ctrl-alt-+ scenario.
After starting and exiting Steam, just about every xrandr command to resize the screen would immediately trigger it.

I had built debug versions of X and xrandr to attempt to isolate the last point that X touches before it crashes in the driver, but I never got far enough into X.org internals. I might dive down that rabbit hole again this weekend. I figured I couldn’t trust the stack trace provided because the crash was likely from reading bad memory at that point.

I was able to test multiple driver versions in my original post, so I believe there was a feature added in 435 (maybe PRIME rendering?) would be the culprit here. I’m happy/sad that I’m not the only person who has experienced this.
Whatever Steam is doing is leaving memory in the driver corrupted. If it’s not triggered automatically by my action, randomly browsing chrome on the other screen/x server would eventually trigger the same crash during a render action.

Perhaps we can finally get someone from nVidia to look into it?

ea.rose.0 · June 6, 2020, 10:08am

I did request an escalation for this via their support chat yesterday, so we may get a developer take a look (fingers crossed).

I don’t run hybrid graphics on my machine. Just a ‘traditional’ CPU + a single GTX card. So not sure about Prime, but I had come to the same conclusion that steam is doing something ‘naughty’ and the nvidia drivers are letting it / not managing things properly.

I’m happy I’m not the only one suffering this - finally, someone else who uses duel head setup and not twinview… On a more serious note though, this is a much better situation for debugging, multiple users and now a consistent/reproducible fault.

elven.thief · June 28, 2020, 7:01am

I just updated to 450.51 and this particular crash appears to have disappeared. There are numerous bug fixes in the release notes, so I’m hoping that whatever the source cause of this was captured in the fix. Linux x64 (AMD64/EM64T) Display Driver | 450.51 | Linux 64-bit | NVIDIA

Topic		Replies	Views
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	50904	December 16, 2015
Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic Linux	23	2117	February 25, 2021
GTX970 346.35 & 346.47 Linux Mint 17.1 Steam CSGO Segfaults during play crash the game Linux	20	6403	March 20, 2015
Non-existent shared VRAM on NVIDIA Linux drivers Linux	39	8828	November 21, 2024
High CPU usage on xorg when the external monitor is plugged in Linux	120	37538	June 21, 2023
Dual GPU problem with multiple displays in GNU/Linux Linux	12	9954	October 12, 2021
Gnome-Shell crashes and rendering problems on XServer with 430.26 Linux	15	5422	July 15, 2019
Driver issue on Ubuntu 19.10 Linux ubuntu	16	4155	April 5, 2020
GTX 1070 "GPU has fallen off the bus" running 3D games in Arch Linux Linux	15	7839	March 19, 2020
nvidia-xconfig doesnt do what i want it to, nor does nvidia-settings Linux	110	81729	October 12, 2021

Nvidia driver segfaults on Steam restart. Occurs with driver versions >= 435.21 and multiscreen configuration

Related topics