Nvidia driver 525 crash on RTX 4000

Hello,

I run an app that uses the gstreamer library and decodes several video streams using avdec_h265 and glimagesink plugins, and I encounter a crash in the Nvidia driver that causes a crash in the server. I tried to look online for a solution but nothing helps.

I use:
ubuntu 20.04
gpu: TU104GL [Quadro RTX 4000]
Nvidia drivers version: 525.78.01
OpenGL version: 4.6.0
gstreamer version: 1.16.3

The relevant crash log:

Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE)
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) Backtrace:
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x558b8c18aecc]
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f41ba1ed420]
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) 2: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaUnlock+0x7ea74) [0x7f41b91c22a4]
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) 3: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaUnlock+0x69733) [0x7f41b91acf63]
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) 4: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaUnlock+0x3ed288) [0x7f41b9530ab8]
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE)
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) Segmentation fault at address 0x0
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE)
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: Fatal server error:
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) Caught signal 11 (Segmentation fault). Server aborting
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE)
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE)
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: Please consult the The X.Org Foundation support
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]:          at http://wiki.x.org
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]:  for help.
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) Please also check the log file at "/home/ub40/.local/share/xorg/Xorg.1.log" for additional information.
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE)
Feb 21 09:15:55 ub40-System-Product-Name /usr/lib/gdm3/gdm-x-session[1832]: (EE) Server terminated with error (1). Closing log file.

nvidia-bug-report.log.gz (338.0 KB)

Did you already check for a driver regression by downgrading it?

Before I moved to RTX 4000 I used GeForce 2070 and I had this crash with various versions of the drivers.
I tried the RTX4000 with 510.108.03 and has the same crash.

Also, yesterday I changed the display manager from gdm3 to xfce+lightdm and the crash didn’t reproduce yet.

That’s really odd, a gstreamer pipeline that doesn’t seem to use hw de-/encoding crashes the nvidia DDX depending on window manager. Do you use any special gst plugins that could make a difference?

For the testing, we simplified the pipeline - we use avdec_h265 and glimagesink, and we removed all the callbacks and special plugins.
We run it without root privileges.
One of our next tests is to check it with gst-launch and not our SW.

Several updates:

  1. We switched the GPU to NVIDIA RTX A4500, and the same crash reproduced on it. Important note: it reproduces less frequently on the RTX A4500 than on Geforce RTX 2070.
  2. We saw in the logs that there are “permission denied” messages in nvidia-drm, not sure if that is related.