Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic

I don’t know if this is of any value for the nvidia linux driver developers or if this is the right place to post this info, but I’ll give it a try: My system as mentioned above crashes every other day arbitrarily in a way I have to perform a hard reset. The syslog logs information as follows below. I assume it’s driver related since it doesn’t happen with windows on the same machine and it doesn’t happen with another graphic card under linux on the same machine.

Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Backtrace:
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55bb9c3ca52c]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 1: /lib/x86_64-linux-gnu/ (funlockfile+0x60) [0x7ffb955c941f]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 2: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x74e88) [0x7ffb94591e28]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 3: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x73861) [0x7ffb9458e9e1]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 4: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x76b35) [0x7ffb94595875]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 5: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x76cb9) [0x7ffb94595be9]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 6: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x6d681) [0x7ffb94582f41]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 7: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x71714) [0x7ffb9458ade4]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 8: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x6df47) [0x7ffb945840d7]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 9: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x739b4) [0x7ffb9458eb34]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 10: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x8ea04) [0x7ffb945c5394]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 11: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x6a29c) [0x7ffb9457c5bc]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 12: /usr/lib/x86_64-linux-gnu/nvidia/xorg/ (nvidiaAddDrawableHandler+0x473c88) [0x7ffb94d8fb90]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 13: ? (?+0x0) [0x0]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Segmentation fault at address 0x7c
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: Fatal server error:
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Caught signal 11 (Segmentation fault). Server aborting
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: Please consult the The X.Org Foundation support
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: #011 at
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: for help.
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Please also check the log file at “/var/log/Xorg.1.log” for additional information.
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe gnome-shell[17402]: [17391:17592:1216/] GPU process isn’t usable. Goodbye.
Dec 16 14:46:22 devbox-mhe kernel: [20941.232659] traps: Chrome_IOThread[17592] trap int3 ip:560370eb37f1 sp:7f39bbb4a480 error:0 in chrome[56036e390000+7b80000]
Dec 16 14:46:22 devbox-mhe systemd[6231]: Condition check resulted in Notification regarding a crash report being skipped.

Might be this:
Chrome/Electron involved?

@generix Not really. It happened once several days ago where Chrome might have been involved but meanwhile I’m pretty sure it’s the GPU acceleration of vscodium as it just occured several times within one hour:

kernel: [ 1669.873160] GpuWatchdog[51211]: segfault at 0 ip 00005563d5e0c049 sp 00007ff469de6440 error 6 in codium[5563d24bb000+5cbd000]    

I then disabled GPU acceleration in vscodium and since then it seems to be stable. It’s well possible that they messed sth. up in the way they use GPU acceleration but it still shouldn’t crash the whole machine I guess.

Isn’t vscodium based on electron?
Of course it’s a bug in the nvidia driver that has to be fixed but so far, no reliable reproduction steps to trigger it have been found.

@generix you’re right! I didn’t know it uses electron. Unfortunately I can’t reproduce it reliably as well but when using vscode it happens every few minutes as long as GPU acceleration is enabled (which is default). I’ll post it in case I find a reliable way of reproducing the issue. Thanks so far!

If you happen to find additional info, please post it to the thread I linked to since nvidia staff already have an eye on that.
What’s your monitor setup (number of monitors, resolution)?

@generix will do! 1 Screen, 4k

Which driver version, specifically, was that crash from? Can you please run sudo and attach the resulting log file here?


Driver Version 455.38 - Here you go: nvidia-bug-report.log.gz (351.7 KB)

Thank you. I filed internal bug number 3208650 to look into tracking this down.

@aplattner Thank you! It just crashed again btw - this time it could have been chrome - at least that was the only application I was interacting with. Relevant syslog entries show:

Dec 18 11:23:37 devbox-mheinrich kernel: [ 8527.720770] NVRM: GPU at PCI:0000:2d:00: GPU-ca1d7679-5ad4-bc79-e4b7-94eb0c2bc757
Dec 18 11:23:37 devbox-mheinrich kernel: [ 8527.720773] NVRM: GPU Board Serial Number:
Dec 18 11:23:37 devbox-mheinrich kernel: [ 8527.720775] NVRM: Xid (PCI:0000:2d:00): 62, pid=1439, 0000(0000) 00000000 00000000
Dec 18 11:23:48 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (II) event5  - ROCCAT ROCCAT Kone Pure Ultra: SYN_DROPPED event - some input events have been lost.
Dec 18 11:23:48 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (EE) client bug: timer event5 debounce: scheduled expiry is in the past (-26ms), your system is too slow
Dec 18 11:23:48 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (EE) client bug: timer event5 debounce short: scheduled expiry is in the past (-39ms), your system is too slow
Dec 18 11:23:49 devbox-mheinrich kernel: [ 8538.801615] NVRM: Xid (PCI:0000:2d:00): 16, pid=0, Head 00000000 Count 000eecb1
Dec 18 11:23:53 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA: Wait for channel idle timed out.
Dec 18 11:23:56 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x243f2, 0x0000cfe4, 0x000043a0)
Dec 18 11:24:03 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x243f2, 0x0000cfe4, 0x000043a0)
Dec 18 11:24:06 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x22a24, 0x0000cfe4, 0x000043a0)

I doubt the “your system is too slow” :-)

1 Like

@aplattner Hi! Is there sth. new? Meanwhile I have Driver version 460.32.03 but the problem still exists. I’m pretty sure it’s a driver issue and not a hardware issue since it’s definitely independent of heat production or load. I can play FlightSim2020 for Hours without problem under Windows. But under Linux with only a browser and a text editor open sometimes it crashes 4 times a day. Attached you find an updated bug report. Thanks a lot in advance.

nvidia-bug-report.log.gz (385.7 KB)

The fix for the X server crash should be in 460.39. However, the crash was happening during error recovery triggered by the Xid errors you’re getting. Those (particularly Xid 62, “Internal micro-controller halt”) look like a GPU or general system stability problem so I’m kind of surprised you’re not seeing them in Windows.

Can you please try cleaning the PCIe contacts, reseating the GPU, and making sure the fans are clean and spinning correctly? The other thing that can cause this kind of problem is inadequate power supply, which can be triggered in Linux and not Windows (or vice-versa) just based on varying workloads from each OS. You could try switching around which external power connectors you have plugged into the GPU, making sure that the connectors are firmly attached.

Might this be a race condition triggered by threaded optimizations, can those still be disabled by the env variable?

@aplattner Everything looks perfect - I didn’t overclock or did any tuning with the GPU at all. Under linux the card is basically idling. I use this as my home office PC as well where I wouldn’t need that power at all. To show you what I mean: that’s the level of usage I have when it crashes. And as I said: Under Windows I can fly FS2020 for hours while the machine is basically a hot air radiator but 100% stable. Not one crash for months where I have like 5 a day under linux.

Wed Feb 10 10:18:18 2021
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 3070    Off  | 00000000:2D:00.0  On |                  N/A |
|  0%   45C    P5    23W / 240W |    512MiB /  7976MiB |     18%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      1453      G   /usr/lib/xorg/Xorg                102MiB |
|    0   N/A  N/A      5237      G   /usr/lib/xorg/Xorg                188MiB |
|    0   N/A  N/A      5468      G   /usr/bin/gnome-shell               80MiB |
|    0   N/A  N/A     16419      G   ...gAAAAAAAAA --shared-files      127MiB |

@generix sounds interesting - but I couldn’t find any current information on how to turn it of under linux or if this impacts stability :-?

Previously, this could be turned off by setting the env variable


@generix yeah I saw that somewhere but I was not aware which applications/libraries obey that env variable. Anyway - I set it in my ~/.profile. I’ll do anything that might help :-D Thanks!

All __GL variables manipulate the behaviour of the nvidia GLX driver for all applications started with it set.

1 Like

Any stability enhancement by setting this parameter observable?