Kernel Oops on boot, 5.9.13 with 455.45.01

Hi

It seems that I’ve been getting an oops on boot for a while, but haven’t noticed. There’s no noticeable problem I can see.

Here’s the oops:

[   20.673580] RIP: 0010:nv_drm_master_set+0x22/0x30 [nvidia_drm]
[   20.673582] Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 47 48 48 8b 78 20 48 8b 05 0c 5d 00 00 48 8b 40 28 e8 e3 07 77 f6 84 c0 74 01 c3 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00
 00 0f 1f 44 00 00 80 3d 8c
[   20.673583] RSP: 0018:ffffa45104f53c00 EFLAGS: 00010246
[   20.673584] RAX: 0000000000000000 RBX: ffff8f182737ba00 RCX: 0000000000000008
[   20.673585] RDX: ffffffffc264eed8 RSI: 0000000000000292 RDI: 0000000000000292
[   20.673586] RBP: ffff8f1875f2ecc0 R08: 0000000000000008 R09: ffffa45104f53be8
[   20.673586] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f188ab63000
[   20.673587] R13: 0000000000000000 R14: ffff8f188ab63000 R15: 000000009c416ba8
[   20.673588] FS:  00007f72082dcb80(0000) GS:ffff8f189ed00000(0000) knlGS:0000000000000000
[   20.673589] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.673590] CR2: 000055bdfd713e08 CR3: 000000038f0b2006 CR4: 00000000001706e0
[   20.673591] Call Trace:
[   20.673609]  drm_new_set_master+0x7a/0x100 [drm]
[   20.673622]  drm_master_open+0x68/0x90 [drm]
[   20.673632]  drm_open+0xf8/0x250 [drm]
[   20.673645]  drm_stub_open+0xab/0x130 [drm]
[   20.673649]  chrdev_open+0xdd/0x210
[   20.673651]  ? cdev_device_add+0x90/0x90
[   20.673653]  do_dentry_open+0x14b/0x360
[   20.673656]  path_openat+0xa70/0xfb0
[   20.673659]  ? vsnprintf+0x387/0x4e0
[   20.673661]  ? page_counter_uncharge+0x36/0x50
[   20.673663]  do_filp_open+0x75/0x100
[   20.673665]  ? __check_object_size+0x136/0x150
[   20.673667]  ? __alloc_fd+0x44/0x150
[   20.673669]  do_sys_openat2+0x7b/0x130
[   20.673671]  __x64_sys_openat+0x46/0x70
[   20.673673]  do_syscall_64+0x33/0x40
[   20.673676]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And here’s the bug report:
nvidia-bug-report.log.gz (1.2 MB)

Regards
elFarto

Same issue with 5.9.15 and 460.27.04.

I’m observing the same kernel oops on Fedora 33 with 5.9.16 and 460.27.04. It’s been happening for a while, but only when I enable options nvidia-drm modeset=1. If I don’t enable this option then my monitor has trouble syncing (display remains blank) after rebooting. Since the kernel oops doesn’t appear to cause any issues as @elFarto mentioned I’ve chosen to leave the module option enabled to resolve the previously mentioned display issue. In my scenario modesetting doesn’t appear to be required for a nice graphical boot on first boot due to UEFI support for setting the resolution at boot and GRUB leaving the resolution alone.

The Xorg modesetting DDX seems to be fighting with the Nvidia DDX for DRM nodes

[   463.657] (II) modeset(G0): using drv /dev/dri/card0

but

[   463.658] (II) Applying OutputClass "nvidia" options to /dev/dri/card0

Nvidia appears to be adding a lot of DRM related stuff to the kernel driver (so it looks a lot like a real DRM driver to userspace now) and the modesetting DDX is getting further into its initialization.

In my case, pointing the modesetting DDX to only look at non-Nvidia devices (the kmsdev option) and ensuring all the Nvidia kernel modules are fully loaded (instead of having them load on-demand) before starting X worked around this. The latter is for some weird race condition where the Nvidia driver is only partially loaded and https://gitlab.freedesktop.org/xorg/xserver/-/blob/master/config/udev.c#L135 fails, which causes modesetting to load regardless (it’s hardcoded in xf86platformAddDevice)

The WARN_ON itself can still be triggered by simply opening a Nvidia primary DRM node though, e.g. with https://gitlab.freedesktop.org/mesa/drm/-/blob/master/tests/drmdevice.c, when some other DRM client like X or fbcon already has master. Because of https://patchwork.freedesktop.org/patch/367748 the Nvidia kernel driver can no longer use its internal grabOwnership to report if DRM master can be taken.

1 Like