Kernel Oops on boot, 5.9.13 with 455.45.01

Hi

It seems that I’ve been getting an oops on boot for a while, but haven’t noticed. There’s no noticeable problem I can see.

Here’s the oops:

[   20.673580] RIP: 0010:nv_drm_master_set+0x22/0x30 [nvidia_drm]
[   20.673582] Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 47 48 48 8b 78 20 48 8b 05 0c 5d 00 00 48 8b 40 28 e8 e3 07 77 f6 84 c0 74 01 c3 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00
 00 0f 1f 44 00 00 80 3d 8c
[   20.673583] RSP: 0018:ffffa45104f53c00 EFLAGS: 00010246
[   20.673584] RAX: 0000000000000000 RBX: ffff8f182737ba00 RCX: 0000000000000008
[   20.673585] RDX: ffffffffc264eed8 RSI: 0000000000000292 RDI: 0000000000000292
[   20.673586] RBP: ffff8f1875f2ecc0 R08: 0000000000000008 R09: ffffa45104f53be8
[   20.673586] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f188ab63000
[   20.673587] R13: 0000000000000000 R14: ffff8f188ab63000 R15: 000000009c416ba8
[   20.673588] FS:  00007f72082dcb80(0000) GS:ffff8f189ed00000(0000) knlGS:0000000000000000
[   20.673589] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.673590] CR2: 000055bdfd713e08 CR3: 000000038f0b2006 CR4: 00000000001706e0
[   20.673591] Call Trace:
[   20.673609]  drm_new_set_master+0x7a/0x100 [drm]
[   20.673622]  drm_master_open+0x68/0x90 [drm]
[   20.673632]  drm_open+0xf8/0x250 [drm]
[   20.673645]  drm_stub_open+0xab/0x130 [drm]
[   20.673649]  chrdev_open+0xdd/0x210
[   20.673651]  ? cdev_device_add+0x90/0x90
[   20.673653]  do_dentry_open+0x14b/0x360
[   20.673656]  path_openat+0xa70/0xfb0
[   20.673659]  ? vsnprintf+0x387/0x4e0
[   20.673661]  ? page_counter_uncharge+0x36/0x50
[   20.673663]  do_filp_open+0x75/0x100
[   20.673665]  ? __check_object_size+0x136/0x150
[   20.673667]  ? __alloc_fd+0x44/0x150
[   20.673669]  do_sys_openat2+0x7b/0x130
[   20.673671]  __x64_sys_openat+0x46/0x70
[   20.673673]  do_syscall_64+0x33/0x40
[   20.673676]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And here’s the bug report:
nvidia-bug-report.log.gz (1.2 MB)

Regards
elFarto

Same issue with 5.9.15 and 460.27.04.

I’m observing the same kernel oops on Fedora 33 with 5.9.16 and 460.27.04. It’s been happening for a while, but only when I enable options nvidia-drm modeset=1. If I don’t enable this option then my monitor has trouble syncing (display remains blank) after rebooting. Since the kernel oops doesn’t appear to cause any issues as @elFarto mentioned I’ve chosen to leave the module option enabled to resolve the previously mentioned display issue. In my scenario modesetting doesn’t appear to be required for a nice graphical boot on first boot due to UEFI support for setting the resolution at boot and GRUB leaving the resolution alone.

The Xorg modesetting DDX seems to be fighting with the Nvidia DDX for DRM nodes

[   463.657] (II) modeset(G0): using drv /dev/dri/card0

but

[   463.658] (II) Applying OutputClass "nvidia" options to /dev/dri/card0

Nvidia appears to be adding a lot of DRM related stuff to the kernel driver (so it looks a lot like a real DRM driver to userspace now) and the modesetting DDX is getting further into its initialization.

In my case, pointing the modesetting DDX to only look at non-Nvidia devices (the kmsdev option) and ensuring all the Nvidia kernel modules are fully loaded (instead of having them load on-demand) before starting X worked around this. The latter is for some weird race condition where the Nvidia driver is only partially loaded and config/udev.c · master · xorg / xserver · GitLab fails, which causes modesetting to load regardless (it’s hardcoded in xf86platformAddDevice)

The WARN_ON itself can still be triggered by simply opening a Nvidia primary DRM node though, e.g. with tests/drmdevice.c · master · Mesa / drm · GitLab, when some other DRM client like X or fbcon already has master. Because of [v2,1/2] drm: vmwgfx: remove drm_driver::master_set() return typ - Patchwork the Nvidia kernel driver can no longer use its internal grabOwnership to report if DRM master can be taken.

1 Like

@osmoticum can you share a little more about what you changed? Do you ever see the oops related to drm_new_set_master after the changes you made? From what I understand there are some bugs but you have some tricks to workaround the issue or reduce the issue occurrence? I know it’s harmless, but it still triggers alerts for people running ABRT for example.

can you share a little more about what you changed?

To load the kernel modules early with systemd, I added a conf file containing

nvidia
nvidia_modeset
nvidia_drm

to modules-load.d. That prevented the modesetting DDX from trying to use NVIDIA GPUs and Xorg itself triggering it for me.

Do you ever see the oops related to drm_new_set_master after the changes you made?

Whenever something pokes at an NVIDIA DRM primary node while Xorg is running with the NVIDIA DDX, such as drmdevice as previously mentioned. eglinfo (from https://archive.mesa3d.org/demos/mesa-demos-8.4.0.tar.bz2) is another example:

$ udevadm info -a /dev/dri/card0 | grep nvidia
    DRIVERS=="nvidia"
Breakpoint 1, __libc_open64 (file=0x7fffffffb560 "/dev/dri/card0", oflag=524290) at ../sysdeps/unix/sysv/linux/open64.c:37
37	in ../sysdeps/unix/sysv/linux/open64.c
(gdb) bt
#0  __libc_open64 (file=0x7fffffffb560 "/dev/dri/card0", oflag=524290) at ../sysdeps/unix/sysv/linux/open64.c:37
#1  0x00007ffff66395e9 in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#2  0x00007ffff666a84a in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#3  0x00007ffff667c7ca in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#4  0x00007ffff6677ae8 in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#5  0x00007ffff6677190 in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#6  0x00007ffff6937d6a in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#7  0x00007ffff6927140 in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#8  0x00007ffff692742d in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#9  0x00007ffff693a561 in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#10 0x00005555555557e9 in doOneDisplay (d=0x5555555af320, name=name@entry=0x5555555561f8 "Device platform") at eglinfo.c:185
#11 0x0000555555555a10 in main (argc=<optimized out>, argv=<optimized out>) at eglinfo.c:238
(gdb) print (int) getpid()
$1 = 35290
(gdb) continue
Continuing.

results in

[  685.515248] CPU: 0 PID: 35290 Comm: eglinfo Tainted: G S         OE     5.10.14 #1
(...)
[  685.515251] RIP: 0010:nv_drm_master_set+0x22/0x30 [nvidia_drm]
[  685.515253] Code: 0f 1f 84 00 00 00 00 00 55 48 8b 47 48 48 89 e5 48 8b 78 20 48 8b 05 cd 5b 00 00 48 8b 40 28 e8 04 56 ca c8 84 c0 74 02 5d c3 <0f> 0b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 80 3d f1 a4 00 00 00 55
[  685.515254] RSP: 0018:ffffb08f0dc0fb90 EFLAGS: 00010246
[  685.515255] RAX: 0000000000000000 RBX: ffff95057719f400 RCX: 0000000000000008
[  685.515255] RDX: ffffffffc251ce18 RSI: 0000000000000296 RDI: ffffffffc251ce10
[  685.515256] RBP: ffffb08f0dc0fb90 R08: 0000000000000008 R09: ffffb08f0dc0fb78
[  685.515256] R10: 0000000000000000 R11: ffff9509b968fd9a R12: ffff9505ecc3cf00
[  685.515257] R13: ffff9502befd8800 R14: 0000000000000000 R15: ffff9502befd8800
[  685.515258] FS:  00007ffff6c3bb80(0000) GS:ffff950a5f800000(0000) knlGS:0000000000000000
[  685.515259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  685.515259] CR2: 00007f34d30e4000 CR3: 0000000ef1fa6005 CR4: 00000000003706f0
[  685.515260] Call Trace:
[  685.515267]  drm_new_set_master+0x79/0x100
[  685.515268]  drm_master_open+0x69/0x90
[  685.515269]  drm_open+0xf7/0x2a0
[  685.515274]  ? radix_tree_lookup+0xd/0x10
[  685.515276]  drm_stub_open+0xb5/0x130
[  685.515281]  chrdev_open+0xae/0x200
[  685.515282]  ? cdev_device_add+0x90/0x90
[  685.515285]  do_dentry_open+0x155/0x370
[  685.515287]  vfs_open+0x28/0x30
[  685.515289]  do_open+0x225/0x310
[  685.515290]  path_openat+0xdb/0x1a0
[  685.515291]  do_filp_open+0x78/0x100
[  685.515292]  ? __check_object_size+0x17/0x20
[  685.515295]  ? strncpy_from_user+0x8c/0x1a0
[  685.515297]  ? __alloc_fd+0x3a/0x150
[  685.515298]  do_sys_openat2+0x7e/0x130
[  685.515300]  __x64_sys_openat+0x44/0x70
[  685.515304]  do_syscall_64+0x38/0x50
[  685.515306]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  685.515307] RIP: 0033:0x7ffff77b2422
[  685.515308] Code: 00 41 00 74 63 64 8b 04 25 18 00 00 00 85 c0 0f 85 83 00 00 00 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 c5 fc 77 0f 05 <48> 3d 00 f0 ff ff 0f 87 aa 00 00 00 48 8b 4c 24 28 66 66 2e 0f 1f
[  685.515309] RSP: 002b:00007fffffffb4d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  685.515310] RAX: ffffffffffffffda RBX: 00007fffffffb6b0 RCX: 00007ffff77b2422
[  685.515311] RDX: 0000000000080002 RSI: 00007fffffffb570 RDI: 00000000ffffff9c
[  685.515311] RBP: 00007fffffffb570 R08: 00007fffffffb590 R09: 000000000000000e
[  685.515312] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000080002
[  685.515312] R13: 0000000000000001 R14: 00007ffff668ad8f R15: 0000000000000000
[  685.515313] ---[ end trace 625cb3d6336878de ]---

in dmesg and

$ sudo cat /sys/kernel/debug/dri/0/clients
             command   pid dev master a   uid      magic
(...)
             eglinfo 35290   0   y    y  1000          0

until eglinfo exits. Something that would actually make use of DRM master like https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c seems to be rejected later on

using card '/dev/dri/card0'
ignoring unused connector 86
ignoring unused connector 89
mode for connector 91 is 3840x2160
ignoring unused connector 94
ignoring unused connector 97
cannot set CRTC for connector 91 (22): Invalid argument
exiting

thankfully. But I have no idea if it is truly harmless or if it somehow messes with the internal state of the driver or Xorg. Sometimes without the aforementioned workarounds, modesetting would manage to randomly grab a NVIDIA GPU, the logs would show

(WW) NVIDIA: No DRM device: Direct render devices found but access was
(WW) NVIDIA:     denied.

and Xorg would segfault shortly afterwards.

Thanks @osmoticum for the reply. I tried adding these modules to /etc/modules-load.d/nvidia.conf but that didn’t change the kernel oops. It’s not a big deal as it doesn’t appear to have any undesired side effects (other than the abrt alert). Sorry, I don’t know enough about EGL or DRM to be of any use in helping you debug the root cause of this issue.

Recently I noticed this oops only happens when I reboot. First boot in the morning when powered off at night doesn’t trigger the oops. Also, the reason I enable modeset is because my monitor remains black (no signal) when rebooting (only in Linux not Windows), which of course makes it hard to get any work done. There appears to be some difference in behavior between a cold boot and reboot. Probably some interaction with UEFI setting the graphics mode since this only occurs when plymouth is running.