Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic

I don’t know if this is of any value for the nvidia linux driver developers or if this is the right place to post this info, but I’ll give it a try: My system as mentioned above crashes every other day arbitrarily in a way I have to perform a hard reset. The syslog logs information as follows below. I assume it’s driver related since it doesn’t happen with windows on the same machine and it doesn’t happen with another graphic card under linux on the same machine.

Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Backtrace:
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55bb9c3ca52c]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7ffb955c941f]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 2: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x74e88) [0x7ffb94591e28]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 3: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x73861) [0x7ffb9458e9e1]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 4: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x76b35) [0x7ffb94595875]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 5: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x76cb9) [0x7ffb94595be9]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 6: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x6d681) [0x7ffb94582f41]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 7: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x71714) [0x7ffb9458ade4]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 8: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x6df47) [0x7ffb945840d7]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 9: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x739b4) [0x7ffb9458eb34]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 10: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x8ea04) [0x7ffb945c5394]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 11: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x6a29c) [0x7ffb9457c5bc]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 12: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x473c88) [0x7ffb94d8fb90]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) 13: ? (?+0x0) [0x0]
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Segmentation fault at address 0x7c
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: Fatal server error:
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Caught signal 11 (Segmentation fault). Server aborting
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: Please consult the The X.Org Foundation support
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: #011 at http://wiki.x.org
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: for help.
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE) Please also check the log file at “/var/log/Xorg.1.log” for additional information.
Dec 16 14:46:22 devbox-mhe /usr/lib/gdm3/gdm-x-session[6334]: (EE)
Dec 16 14:46:22 devbox-mhe gnome-shell[17402]: [17391:17592:1216/144622.397805:FATAL:gpu_data_manager_impl_private.cc(445)] GPU process isn’t usable. Goodbye.
Dec 16 14:46:22 devbox-mhe kernel: [20941.232659] traps: Chrome_IOThread[17592] trap int3 ip:560370eb37f1 sp:7f39bbb4a480 error:0 in chrome[56036e390000+7b80000]
Dec 16 14:46:22 devbox-mhe systemd[6231]: Condition check resulted in Notification regarding a crash report being skipped.

Might be this:
https://forums.developer.nvidia.com/t/segmentation-fault-in-nvidiaadddrawablehandler/128324/10
Chrome/Electron involved?

@generix Not really. It happened once several days ago where Chrome might have been involved but meanwhile I’m pretty sure it’s the GPU acceleration of vscodium as it just occured several times within one hour:

kernel: [ 1669.873160] GpuWatchdog[51211]: segfault at 0 ip 00005563d5e0c049 sp 00007ff469de6440 error 6 in codium[5563d24bb000+5cbd000]    

I then disabled GPU acceleration in vscodium and since then it seems to be stable. It’s well possible that they messed sth. up in the way they use GPU acceleration but it still shouldn’t crash the whole machine I guess.

Isn’t vscodium based on electron?
Of course it’s a bug in the nvidia driver that has to be fixed but so far, no reliable reproduction steps to trigger it have been found.

@generix you’re right! I didn’t know it uses electron. Unfortunately I can’t reproduce it reliably as well but when using vscode it happens every few minutes as long as GPU acceleration is enabled (which is default). I’ll post it in case I find a reliable way of reproducing the issue. Thanks so far!

If you happen to find additional info, please post it to the thread I linked to since nvidia staff already have an eye on that.
What’s your monitor setup (number of monitors, resolution)?

@generix will do! 1 Screen, 4k

Which driver version, specifically, was that crash from? Can you please run sudo nvidia-bug-report.sh and attach the resulting log file here?

@aplattner

Driver Version 455.38 - Here you go: nvidia-bug-report.log.gz (351.7 KB)

Thank you. I filed internal bug number 3208650 to look into tracking this down.

@aplattner Thank you! It just crashed again btw - this time it could have been chrome - at least that was the only application I was interacting with. Relevant syslog entries show:

Dec 18 11:23:37 devbox-mheinrich kernel: [ 8527.720770] NVRM: GPU at PCI:0000:2d:00: GPU-ca1d7679-5ad4-bc79-e4b7-94eb0c2bc757
Dec 18 11:23:37 devbox-mheinrich kernel: [ 8527.720773] NVRM: GPU Board Serial Number:
Dec 18 11:23:37 devbox-mheinrich kernel: [ 8527.720775] NVRM: Xid (PCI:0000:2d:00): 62, pid=1439, 0000(0000) 00000000 00000000
Dec 18 11:23:48 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (II) event5  - ROCCAT ROCCAT Kone Pure Ultra: SYN_DROPPED event - some input events have been lost.
Dec 18 11:23:48 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (EE) client bug: timer event5 debounce: scheduled expiry is in the past (-26ms), your system is too slow
Dec 18 11:23:48 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (EE) client bug: timer event5 debounce short: scheduled expiry is in the past (-39ms), your system is too slow
Dec 18 11:23:49 devbox-mheinrich kernel: [ 8538.801615] NVRM: Xid (PCI:0000:2d:00): 16, pid=0, Head 00000000 Count 000eecb1
Dec 18 11:23:53 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA: Wait for channel idle timed out.
Dec 18 11:23:56 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x243f2, 0x0000cfe4, 0x000043a0)
Dec 18 11:24:03 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x243f2, 0x0000cfe4, 0x000043a0)
Dec 18 11:24:06 devbox-mheinrich /usr/lib/gdm3/gdm-x-session[4189]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x22a24, 0x0000cfe4, 0x000043a0)

I doubt the “your system is too slow” :-)

1 Like