Bug: driver crash on linux kernel > 5.4 when using two GPUs

nvidia-bug-report.log.gz (386.1 KB)

I’m on gentoo running

  • GeForce GTX 960 in slot 1
  • GeForce GTX 950 in slot 2

I have two cards because I use one for VFIO/GPU passthrough (using the ACS override patch).

Things work normally on Linux kernel 5.4.60 using nvidia drivers 460.91.03-r2.

However, over the past several months, I have attempted to try upgrading both the kernel and the drivers, but have had no luck. Any combination of the following has triggered a crash upon start-up:

  • Kernels 5.10, 5.12, 5.15
  • Nvidia drivers 460.x, 470.x, 495.x
  • ACS override patch on or off

Here’s a snippet of the boot log running kernel 5.15, driver 470.x, ACS overrride disabled

Jan 01 16:33:42 [kernel] [   20.958863] nvidia: loading out-of-tree module taints kernel.
                - Last output repeated twice -
Jan 01 16:33:42 [kernel] [   20.958871] nvidia: module license 'NVIDIA' taints kernel.
Jan 01 16:33:42 [kernel] [   20.958937] nvidia: module license 'NVIDIA' taints kernel.
Jan 01 16:33:42 [kernel] [   20.958995] Disabling lock debugging due to kernel taint
Jan 01 16:33:42 [kernel] [   20.972276] nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jan 01 16:33:42 [kernel] [   20.983866] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Jan 01 16:33:42 [kernel] [   20.983944] 
Jan 01 16:33:42 [kernel] [   20.984656] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Jan 01 16:33:42 [kernel] [   21.184278] NVRM: The NVIDIA probe routine was not called for 1 device(s).
Jan 01 16:33:42 [kernel] [   21.184376] NVRM: This can occur when a driver such as: 
Jan 01 16:33:42 [kernel] [   21.184376] NVRM: nouveau, rivafb, nvidiafb or rivatv 
Jan 01 16:33:42 [kernel] [   21.184376] NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Jan 01 16:33:42 [kernel] [   21.184470] NVRM: Try unloading the conflicting kernel module (and/or
Jan 01 16:33:42 [kernel] [   21.184470] NVRM: reconfigure your kernel without the conflicting
Jan 01 16:33:42 [kernel] [   21.184470] NVRM: driver(s)), then try loading the NVIDIA kernel module
Jan 01 16:33:42 [kernel] [   21.184470] NVRM: again.
Jan 01 16:33:42 [kernel] [   21.184583] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  495.46  Wed Oct 27 16:31:33 UTC 2021
Jan 01 16:33:42 [kernel] [   21.339294] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  495.46  Wed Oct 27 16:22:48 UTC 2021
Jan 01 16:33:42 [kernel] [   21.409697] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jan 01 16:33:42 [kernel] [   21.410151] pmd_set_huge: Cannot satisfy [mem 0xf6000000-0xf6200000] with a huge-page mapping due to MTRR override.
Jan 01 16:33:42 [kernel] [   21.424758] resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000e0000-0x000e3fff window]
Jan 01 16:33:42 [kernel] [   21.424836] caller _nv032275rm+0x2a/0x60 [nvidia] mapping multiple BARs
Jan 01 16:33:42 [kernel] [   21.577405] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window]
Jan 01 16:33:42 [kernel] [   21.577482] caller _nv000717rm+0x1ad/0x200 [nvidia] mapping multiple BARs
Jan 01 16:33:42 [kernel] [   22.293655] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Jan 01 16:33:42 [kernel] [   32.709826] ext2 filesystem being mounted at /boot supports timestamps until 2038 (0x7fffffff)
Jan 01 16:33:42 [kernel] [   32.794307] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Jan 01 16:33:42 [kernel] [   32.867490] SGI XFS with ACLs, security attributes, quota, no debug enabled
Jan 01 16:33:42 [kernel] [   32.868397] XFS (dm-6): Mounting V5 Filesystem
Jan 01 16:33:42 [kernel] [   33.088618] XFS (dm-6): Ending clean mount
Jan 01 16:33:42 [kernel] [   33.089936] xfs filesystem being mounted at /data1 supports timestamps until 2038 (0x7fffffff)
Jan 01 16:33:42 [kernel] [   33.163333] EXT4-fs (dm-8): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Jan 01 16:33:42 [kernel] [   33.200724] EXT4-fs (dm-3): re-mounted. Opts: . Quota mode: none.
Jan 01 16:33:42 [kernel] [   33.381467] Adding 10485756k swap on /dev/mapper/vg1-swap.  Priority:-2 extents:1 across:10485756k 
Jan 01 16:33:43 [kernel] [   40.202600] #PF: error_code(0x0000) - not-present page
Jan 01 16:33:43 [kernel] [   40.202666] PGD 0 P4D 0 
Jan 01 16:33:43 [kernel] [   40.202732] Oops: 0000 [#1] SMP PTI
Jan 01 16:33:43 [kernel] [   40.202797] CPU: 5 PID: 3726 Comm: X Tainted: P           OE     5.15.11-gentoo-x86_64 #1
Jan 01 16:33:43 [kernel] [   40.202873] Hardware name: MSI MS-7922/ Z97S SLI Krait Edition (MS-7922), BIOS V10.7 02/16/2016
Jan 01 16:33:43 [kernel] [   40.202948] RIP: 0010:nv_audio_dynamic_power+0xb4/0x120 [nvidia]
Jan 01 16:33:43 [kernel] [   40.203154] Code: 01 00 00 48 85 d2 74 9a 48 8b 82 a8 01 00 00 48 81 c2 a0 01 00 00 48 39 d0 75 0f eb 85 48 8b 40 08 48 39 d0 0f 84 78 ff ff ff <83> 78 1c 03 75 ed 48 8b 78 20 48 8b 87 30 03 00 00 48 85 ff 0f 84
Jan 01 16:33:43 [kernel] [   40.203243] RSP: 0018:ffffc900021c7490 EFLAGS: 00010207
Jan 01 16:33:43 [kernel] [   40.203309] RAX: 0000000000000000 RBX: ffff88811125f080 RCX: 0000000000000002
Jan 01 16:33:43 [kernel] [   40.203377] RDX: ffff888106adfda0 RSI: 0000000000000000 RDI: ffff888101557108
Jan 01 16:33:43 [kernel] [   40.203444] RBP: ffff88811125ef70 R08: ffff888101271ca0 R09: 0000000000000000
Jan 01 16:33:43 [kernel] [   40.203511] R10: ffffffffa125d2a0 R11: ffffc9000015d008 R12: ffff88811125efb8
Jan 01 16:33:43 [kernel] [   40.203579] R13: ffffffffa332b180 R14: ffff88810afc4020 R15: 0000000000000000
Jan 01 16:33:43 [kernel] [   40.203645] FS:  00007fee4397f8c0(0000) GS:ffff88881ed40000(0000) knlGS:0000000000000000
Jan 01 16:33:43 [kernel] [   40.203719] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 01 16:33:43 [kernel] [   40.203785] CR2: 000000000000001c CR3: 0000000109bb8006 CR4: 00000000001706e0
Jan 01 16:33:43 [kernel] [   40.203853] Call Trace:
Jan 01 16:33:43 [kernel] [   40.203918]  <TASK>
Jan 01 16:33:43 [kernel] [   40.203982]  ? _nv035225rm+0x3b/0x150 [nvidia]
Jan 01 16:33:43 [kernel] [   40.204177]  _nv037976rm+0x25/0x30 [nvidia]
Jan 01 16:33:43 [kernel] [   40.204396]  ? _nv000790rm+0x70/0x70 [nvidia]
Jan 01 16:33:43 [kernel] [   40.204610]  ? _nv034973rm+0x18c/0x1a0 [nvidia]
Jan 01 16:33:43 [kernel] [   40.204860]  ? _nv036712rm+0x265/0x2c0 [nvidia]
Jan 01 16:33:43 [kernel] [   40.205112]  ? _nv014671rm+0x76e/0x920 [nvidia]
Jan 01 16:33:43 [kernel] [   40.205308]  ? _nv035105rm+0x53/0x170 [nvidia]
Jan 01 16:33:43 [kernel] [   40.205492]  ? _nv019170rm+0x842/0xc90 [nvidia]
Jan 01 16:33:43 [kernel] [   40.205676]  ? _nv019170rm+0xc86/0xc90 [nvidia]
Jan 01 16:33:43 [kernel] [   40.205861]  ? _nv019170rm+0xc6f/0xc90 [nvidia]
Jan 01 16:33:43 [kernel] [   40.206045]  ? rm_kernel_rmapi_op+0x141/0x190 [nvidia]
Jan 01 16:33:43 [kernel] [   40.206263]  ? nvkms_call_rm+0x4b/0x80 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.206340]  ? _nv002514kms+0x51/0x60 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.206417]  ? _raw_spin_lock_irqsave+0x32/0x50
Jan 01 16:33:43 [kernel] [   40.206486]  ? _nv002155kms+0x5cf/0x9e0 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.206561]  ? trace_hardirqs_on+0x2b/0xb0
Jan 01 16:33:43 [kernel] [   40.206627]  ? _raw_spin_unlock_irqrestore+0x16/0x20
Jan 01 16:33:43 [kernel] [   40.206694]  ? nv_init_msi+0xcc/0xf0 [nvidia]
Jan 01 16:33:43 [kernel] [   40.206836]  ? _nv002376kms+0x12b/0x510 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.206912]  ? nvkms_call_rm+0x5b/0x80 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.206983]  ? _nv002514kms+0x51/0x60 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207060]  ? _nv002607kms+0x20b/0xce0 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207137]  ? _nv002569kms+0x2796/0x2e40 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207213]  ? finish_task_switch.isra.0+0xa8/0x260
Jan 01 16:33:43 [kernel] [   40.207281]  ? get_page_from_freelist+0xc5/0x3b0
Jan 01 16:33:43 [kernel] [   40.207348]  ? trace_hardirqs_on+0x2b/0xb0
Jan 01 16:33:43 [kernel] [   40.207414]  ? _nv002584kms+0x19a/0x760 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207490]  ? trace_hardirqs_on+0x2b/0xb0
Jan 01 16:33:43 [kernel] [   40.207555]  ? nv_kthread_q_stop+0x2240/0x2d30 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207628]  ? _nv002333kms+0x11a/0x230 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207701]  ? trace_hardirqs_on+0x2b/0xb0
Jan 01 16:33:43 [kernel] [   40.207767]  ? kfree+0xb3/0x160
Jan 01 16:33:43 [kernel] [   40.207833]  ? _nv002054kms+0x18b3/0x2710 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.207912]  ? trace_hardirqs_on+0x2b/0xb0
Jan 01 16:33:43 [kernel] [   40.207977]  ? _nv002054kms+0x18b3/0x2710 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208056]  ? _nv002306kms+0x426/0x1330 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208149]  ? trace_hardirqs_on+0x2b/0xb0
Jan 01 16:33:43 [kernel] [   40.208237]  ? kfree+0xb3/0x160
Jan 01 16:33:43 [kernel] [   40.208322]  ? nv_kthread_q_stop+0x2188/0x2d30 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208394]  ? nv_kthread_q_stop+0x2216/0x2d30 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208467]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208539]  ? nvkms_ioctl_from_kapi+0x4c/0x90 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208611]  ? _nv002054kms+0x36c/0x2710 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.208689]  ? nv_drm_exit+0xde/0x350 [nvidia_drm]
Jan 01 16:33:43 [kernel] [   40.208756]  ? _nv002054kms+0x330/0x2710 [nvidia_modeset]
Jan 01 16:33:43 [kernel] [   40.209070]  ? __fput+0x94/0x250
Jan 01 16:33:43 [kernel] [   40.214960]  ? task_work_run+0x61/0x90
Jan 01 16:33:43 [kernel] [   40.215026]  ? exit_to_user_mode_loop+0x133/0x140
Jan 01 16:33:43 [kernel] [   40.215093]  ? exit_to_user_mode_prepare+0x8d/0xa0
Jan 01 16:33:43 [kernel] [   40.215159]  ? syscall_exit_to_user_mode+0x27/0x50
Jan 01 16:33:43 [kernel] [   40.215245]  ? do_syscall_64+0x48/0xc0
Jan 01 16:33:43 [kernel] [   40.215310]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 01 16:33:43 [kernel] [   40.215378]  </TASK>
Jan 01 16:33:43 [kernel] [   40.215442] Modules linked in: xfs ext2 nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) x86_pkg_temp_thermal coretemp drm_kms_helper kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio drm snd_hda_intel syscopyarea ppdev sysfillrect mxm_wmi snd_intel_dspcfg at24 regmap_i2c snd_hda_codec iTCO_wdt iTCO_vendor_support i2c_i801 snd_hda_core ghash_clmulni_intel sysimgblt lpc_ich i2c_smbus fb_sys_fops snd_hwdep i2c_core mfd_core pcspkr snd_pcm parport_pc parport wmi video btrfs blake2b_generic xor zstd_compress hci_vhci ppp_generic slhc bluetooth vhost_net snd_seq tun snd_seq_device snd_timer cuse vhost ecdh_generic vhost_iotlb fuse tap ecc snd raid6_pq autofs4 soundcore rfkill nvram ext4 mbcache jbd2 dm_crypt encrypted_keys sd_mod t10_pi xhci_pci crc32c_intel ahci r8169 libahci xhci_hcd realtek
Jan 01 16:33:43 [kernel] [   40.215646] CR2: 000000000000001c
Jan 01 16:33:43 [kernel] [   40.215713] ---[ end trace b49938411775b3a9 ]---
Jan 01 16:33:43 [kernel] [   40.215797] RIP: 0010:nv_audio_dynamic_power+0xb4/0x120 [nvidia]
Jan 01 16:33:43 [kernel] [   40.215937] Code: 01 00 00 48 85 d2 74 9a 48 8b 82 a8 01 00 00 48 81 c2 a0 01 00 00 48 39 d0 75 0f eb 85 48 8b 40 08 48 39 d0 0f 84 78 ff ff ff <83> 78 1c 03 75 ed 48 8b 78 20 48 8b 87 30 03 00 00 48 85 ff 0f 84
Jan 01 16:33:43 [kernel] [   40.216026] RSP: 0018:ffffc900021c7490 EFLAGS: 00010207
Jan 01 16:33:43 [kernel] [   40.216092] RAX: 0000000000000000 RBX: ffff88811125f080 RCX: 0000000000000002
Jan 01 16:33:43 [kernel] [   40.216159] RDX: ffff888106adfda0 RSI: 0000000000000000 RDI: ffff888101557108
Jan 01 16:33:43 [kernel] [   40.216245] RBP: ffff88811125ef70 R08: ffff888101271ca0 R09: 0000000000000000
Jan 01 16:33:43 [kernel] [   40.216331] R10: ffffffffa125d2a0 R11: ffffc9000015d008 R12: ffff88811125efb8
Jan 01 16:33:43 [kernel] [   40.216399] R13: ffffffffa332b180 R14: ffff88810afc4020 R15: 0000000000000000
Jan 01 16:33:43 [kernel] [   40.216466] FS:  00007fee4397f8c0(0000) GS:ffff88881ed40000(0000) knlGS:0000000000000000
Jan 01 16:33:43 [kernel] [   40.216541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 01 16:33:43 [kernel] [   40.216607] CR2: 000000000000001c CR3: 0000000109bb8006 CR4: 00000000001706e0

Have an update on this.

One clue that helped to figure this out was this line

Jan 01 16:33:43 [kernel] [   40.202948] RIP: 0010:nv_audio_dynamic_power+0xb4/0x120 [nvidia]

This implies there’s something wrong with the audio configuration.

As I’m using VFIO I figured there must be a problem with the way I was configuring it.

One of the steps involved in configuring VFIO is passing vfio-pci.ids=... as a kernel boot argument.

Since I had two GPUs, the 960 and 950, I only needed to pass the 950 so the argument was

vfio-pci.ids=10de:1402,10de:0fba

However, taking a look at the PCI devices shows this

$ lspci -nn | grep -i nvi      
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GM206 High Definition Audio Controller [10de:0fba] (rev a1)
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 950] [10de:1402] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation GM206 High Definition Audio Controller [10de:0fba] (rev a1)

Notice that the audio devices for each GPU has the same ID!

Removing the audio ID from the argument allowed it to work :) (kernel 5.15, drivers 510)

There is still a way to get the audio device passed in, and I’ll update on here if/when I get it working, but for now, at least my host can actually boot!