can you share a little more about what you changed?
To load the kernel modules early with systemd, I added a conf file containing
nvidia
nvidia_modeset
nvidia_drm
to modules-load.d. That prevented the modesetting DDX from trying to use NVIDIA GPUs and Xorg itself triggering it for me.
Do you ever see the oops related to drm_new_set_master
after the changes you made?
Whenever something pokes at an NVIDIA DRM primary node while Xorg is running with the NVIDIA DDX, such as drmdevice as previously mentioned. eglinfo (from https://archive.mesa3d.org/demos/mesa-demos-8.4.0.tar.bz2) is another example:
$ udevadm info -a /dev/dri/card0 | grep nvidia
DRIVERS=="nvidia"
Breakpoint 1, __libc_open64 (file=0x7fffffffb560 "/dev/dri/card0", oflag=524290) at ../sysdeps/unix/sysv/linux/open64.c:37
37 in ../sysdeps/unix/sysv/linux/open64.c
(gdb) bt
#0 __libc_open64 (file=0x7fffffffb560 "/dev/dri/card0", oflag=524290) at ../sysdeps/unix/sysv/linux/open64.c:37
#1 0x00007ffff66395e9 in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#2 0x00007ffff666a84a in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#3 0x00007ffff667c7ca in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#4 0x00007ffff6677ae8 in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#5 0x00007ffff6677190 in ?? () from /opt/nvidia/lib64/libnvidia-glsi.so.460.39
#6 0x00007ffff6937d6a in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#7 0x00007ffff6927140 in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#8 0x00007ffff692742d in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#9 0x00007ffff693a561 in ?? () from /opt/nvidia/lib64/libEGL_nvidia.so.0
#10 0x00005555555557e9 in doOneDisplay (d=0x5555555af320, name=name@entry=0x5555555561f8 "Device platform") at eglinfo.c:185
#11 0x0000555555555a10 in main (argc=<optimized out>, argv=<optimized out>) at eglinfo.c:238
(gdb) print (int) getpid()
$1 = 35290
(gdb) continue
Continuing.
results in
[ 685.515248] CPU: 0 PID: 35290 Comm: eglinfo Tainted: G S OE 5.10.14 #1
(...)
[ 685.515251] RIP: 0010:nv_drm_master_set+0x22/0x30 [nvidia_drm]
[ 685.515253] Code: 0f 1f 84 00 00 00 00 00 55 48 8b 47 48 48 89 e5 48 8b 78 20 48 8b 05 cd 5b 00 00 48 8b 40 28 e8 04 56 ca c8 84 c0 74 02 5d c3 <0f> 0b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 80 3d f1 a4 00 00 00 55
[ 685.515254] RSP: 0018:ffffb08f0dc0fb90 EFLAGS: 00010246
[ 685.515255] RAX: 0000000000000000 RBX: ffff95057719f400 RCX: 0000000000000008
[ 685.515255] RDX: ffffffffc251ce18 RSI: 0000000000000296 RDI: ffffffffc251ce10
[ 685.515256] RBP: ffffb08f0dc0fb90 R08: 0000000000000008 R09: ffffb08f0dc0fb78
[ 685.515256] R10: 0000000000000000 R11: ffff9509b968fd9a R12: ffff9505ecc3cf00
[ 685.515257] R13: ffff9502befd8800 R14: 0000000000000000 R15: ffff9502befd8800
[ 685.515258] FS: 00007ffff6c3bb80(0000) GS:ffff950a5f800000(0000) knlGS:0000000000000000
[ 685.515259] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 685.515259] CR2: 00007f34d30e4000 CR3: 0000000ef1fa6005 CR4: 00000000003706f0
[ 685.515260] Call Trace:
[ 685.515267] drm_new_set_master+0x79/0x100
[ 685.515268] drm_master_open+0x69/0x90
[ 685.515269] drm_open+0xf7/0x2a0
[ 685.515274] ? radix_tree_lookup+0xd/0x10
[ 685.515276] drm_stub_open+0xb5/0x130
[ 685.515281] chrdev_open+0xae/0x200
[ 685.515282] ? cdev_device_add+0x90/0x90
[ 685.515285] do_dentry_open+0x155/0x370
[ 685.515287] vfs_open+0x28/0x30
[ 685.515289] do_open+0x225/0x310
[ 685.515290] path_openat+0xdb/0x1a0
[ 685.515291] do_filp_open+0x78/0x100
[ 685.515292] ? __check_object_size+0x17/0x20
[ 685.515295] ? strncpy_from_user+0x8c/0x1a0
[ 685.515297] ? __alloc_fd+0x3a/0x150
[ 685.515298] do_sys_openat2+0x7e/0x130
[ 685.515300] __x64_sys_openat+0x44/0x70
[ 685.515304] do_syscall_64+0x38/0x50
[ 685.515306] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 685.515307] RIP: 0033:0x7ffff77b2422
[ 685.515308] Code: 00 41 00 74 63 64 8b 04 25 18 00 00 00 85 c0 0f 85 83 00 00 00 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 c5 fc 77 0f 05 <48> 3d 00 f0 ff ff 0f 87 aa 00 00 00 48 8b 4c 24 28 66 66 2e 0f 1f
[ 685.515309] RSP: 002b:00007fffffffb4d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 685.515310] RAX: ffffffffffffffda RBX: 00007fffffffb6b0 RCX: 00007ffff77b2422
[ 685.515311] RDX: 0000000000080002 RSI: 00007fffffffb570 RDI: 00000000ffffff9c
[ 685.515311] RBP: 00007fffffffb570 R08: 00007fffffffb590 R09: 000000000000000e
[ 685.515312] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000080002
[ 685.515312] R13: 0000000000000001 R14: 00007ffff668ad8f R15: 0000000000000000
[ 685.515313] ---[ end trace 625cb3d6336878de ]---
in dmesg and
$ sudo cat /sys/kernel/debug/dri/0/clients
command pid dev master a uid magic
(...)
eglinfo 35290 0 y y 1000 0
until eglinfo exits. Something that would actually make use of DRM master like https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c seems to be rejected later on
using card '/dev/dri/card0'
ignoring unused connector 86
ignoring unused connector 89
mode for connector 91 is 3840x2160
ignoring unused connector 94
ignoring unused connector 97
cannot set CRTC for connector 91 (22): Invalid argument
exiting
thankfully. But I have no idea if it is truly harmless or if it somehow messes with the internal state of the driver or Xorg. Sometimes without the aforementioned workarounds, modesetting would manage to randomly grab a NVIDIA GPU, the logs would show
(WW) NVIDIA: No DRM device: Direct render devices found but access was
(WW) NVIDIA: denied.
and Xorg would segfault shortly afterwards.