Kernel null pointer dereference in nvidia_modeset during Thunderbolt dock disconnect

Bug Report: Kernel NULL pointer dereference in nvidia_modeset during Thunderbolt dock disconnect (multiple monitors only)

Summary

Disconnecting a Thunderbolt 4 dock causes a kernel NULL pointer dereference in nvidia_modeset during drm_atomic_commit, crashing the sway compositor and freezing the system.

The crash only occurs when 2+ external monitors are connected. With a single external monitor, hot-unplug works correctly and sway gracefully falls back to the internal display.

This suggests a bug in the atomic commit path when disabling multiple CRTCs/connectors simultaneously during hot-unplug.

Hardware

  • Laptop: Lenovo ThinkPad X1 Extreme Gen 2 (20QV001GMX)

  • GPU: NVIDIA GeForce GTX 1650 Mobile / Max-Q [10de:1f91] (Turing)

  • iGPU: Intel UHD Graphics 630

  • Dock: Lenovo ThinkPad Thunderbolt 4 Dock (40B0)

  • Thunderbolt Controller: Intel JHL7540 (Titan Ridge)

  • External Displays: Samsung S32D850 (2560x1440), Samsung S34CG50 (3440x1440) via dock DP ports

Software

  • OS: NixOS 25.11

  • Compositor: sway (wlroots-based Wayland compositor)

  • Display Configuration: Hybrid GPU setup using WLR_DRM_DEVICES=/dev/dri/igpu:/dev/dri/dgpu

Versions Tested (all exhibit the crash)

NVIDIA Drivers:

  • 580.119.02 (stable/production)

  • 590.48.01 (latest/beta)

  • Both open and closed kernel modules

Linux Kernels:

  • 6.6.x (LTS)

  • 6.12.64

  • 6.18.4 (latest)

Steps to Reproduce

  1. Boot system with Thunderbolt dock connected

  2. Connect two or more external monitors to the dock (e.g., one HDMI, one DisplayPort)

  3. Login to sway compositor (external monitors work correctly)

  4. Physically disconnect the Thunderbolt dock cable

  5. System freezes immediately

Does not crash when:

  • Only one external monitor is connected to the dock. In this case, hot-unplug works correctly and sway falls back to eDP-1

Behavior

Kernel NULL pointer dereference in nvidia_modeset, killing the sway process and freezing the display. System requires hard reboot.

Kernel Oops (Driver 590.48.01, Kernel 6.12.64)

BUG: kernel NULL pointer dereference, address: 0000000000000409
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 2 UID: 1000 PID: 1592 Comm: sway Tainted: P           O       6.12.64 #1-NixOS
Hardware name: LENOVO 20QV001GMX/20QV001GMX, BIOS N2OET69W (1.56 ) 12/02/2025
RIP: 0010:_nv000778kms+0x4/0x10 [nvidia_modeset]

Call Trace:
 <TASK>
 _nv001339kms+0x94/0x180 [nvidia_modeset]
 _nv001293kms+0x25f/0x520 [nvidia_modeset]
 _nv001254kms+0xb4/0x202 [nvidia_modeset]
 _nv001271kms+0x9d/0x180 [nvidia_modeset]
 _nv002592kms+0x45f/0x7d0 [nvidia_modeset]
 _nv000725kms+0x1a6/0x620 [nvidia_modeset]
 ? _nv003066kms+0x74/0x140 [nvidia_modeset]
 _nv003103kms+0x9a1/0x45e0 [nvidia_modeset]
 nvKmsIoctl+0xf7/0x270 [nvidia_modeset]
 nvkms_ioctl_from_kapi_try_pmlock+0x60/0xa0 [nvidia_modeset]
 _nv000022kms+0x33e/0xbb0 [nvidia_modeset]
 nv_drm_atomic_apply_modeset_config+0x709/0x7b0 [nvidia_drm]
 drm_atomic_check_only+0x5f3/0xa10
 drm_atomic_commit+0x69/0xe0
 drm_mode_atomic_ioctl+0xaff/0xd70
 drm_ioctl_kernel+0xad/0x100
 drm_ioctl+0x2b0/0x520
 __x64_sys_ioctl+0x91/0xd0
 do_syscall_64+0xae/0x200
 </TASK>

CR2: 0000000000000409
note: sway[1592] exited with irqs disabled

Kernel Oops (Driver 580.119.02) - Same crash, different symbol

RIP: 0010:_nv000899kms+0x4/0x10 [nvidia_modeset]

Same call trace through nv_drm_atomic_apply_modeset_config.

Configuration Options Tested (none helped)

  • hardware.nvidia.open = true/false

  • hardware.nvidia.powerManagement.enable = true/false

  • hardware.nvidia.nvidiaPersistenced = true

  • hardware.nvidia.forceFullCompositionPipeline = true

  • boot.kernelParams = ["pcie_aspm=off" "pcie_port_pm=off"]

  • services.hardware.bolt.enable = true

  • Loading nvidia modules in initrd

  • dGPU-only mode (no hybrid), with and without discrete graphics mode.

  • Various NVreg_* module parameters

dmesg Context (events leading to crash)

pcieport 0000:05:04.0: pciehp: Slot(4): Link Down
pcieport 0000:05:04.0: pciehp: Slot(4): Card not present
xhci_hcd 0000:2f:00.0: remove, state 1
usb usb6: USB disconnect, device number 1
[... USB teardown ...]
pci_bus 0000:2e: busn_res: [bus 2e-51] is released
BUG: kernel NULL pointer dereference, address: 0000000000000409

nvidia-bug-report.sh

nvidia-bug-report_before_freeze.log.gz (476.3 KB)

It is not possible to run sudo nvidia-bug-report.sh after it freezes, not even through ssh.

Workaround

None or accept a hard reboot.

from the README:

system stability when an eGPU is unplugged while in use (also known as “hot-unplug”) is not guaranteed.

The fact that it worked when 1 monitor was connected is just a lucky coincidence.
See the Debian wiki on how to safely hot-unplug.

Of course it would be great if NV could make unplugging less dangerous, but given how criminally understaffed the Desktop Linux team is, there are little chances for it :(

Thanks for the tip. Disabling the outputs before unplugging is a nice workaround for me.

I dont know if it matter, but this is a dGPU, not an eGPU.

I poked around some more and found that running:

echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/unbind

also crashes the system, i dont think it is the same crash, but not sure. No idea if this reproduces on other machines.

LOLz, from your 1st post I got an impression that “Lenovo ThinkPad Thunderbolt 4 Dock (40B0)” is an eGPU dock :)))) My bad, I’m biased towards eGPUs ;-)

So in case when only displays are connected via the dock (and the GPU stays connected to your mobo), you SHOULD indeed be able to disconnect it without any preparations. Sorry for confusion again.

As explained in the wiki, before unbinding, you must ensure no processes run on a given GPU (ie nvidia-smi must say “No running processes found”). Otherwise there’s no way it does not crash at least those processes (and currently also the kernel module unfortunately).

1 Like

@okok17 marking it as solved was probably not a right move as now no NV eng will probably ever look at this and there is a bug there that should be fixed (disconnecting a dock with several displays should work fine).

Also I should note, that unbinding a dGPU rarely makes sense. The only scenario I can think of is if you want to pass it through to a guest virtual machine using VFIO.

Hi all,

Thank you for reporting the issue. We are tracking this internally as Bug #5871511 . We will try and reproduce the issue for investigation. Please let us know if a newer driver resolves this issue.

Did this start to fail with a particular NVIDIA driver version?

2 Likes

Hi @okok17,

On top of the ask from @abchauhan, could we reproduce the issue with the following patch built into the open-gpu-kernel-modules for 590.48.01?

With these patches you will need to set the nvidia_modeset.debug=1 parameter for the debug logs to be generated. After reproducing the issue, we will need to capture the logs with the related prints using sudo nvidia-bug-report.sh after the crash.

2 Likes