I sent the following to NVIDIA and Sonnet support, but posting here too in case anyone is running into similar problems with the RTX 5080 on Linux using the official drivers from the CUDA rhel10 repo, both 580 and 590.
Summary
RTX 5080 connected via Thunderbolt 5 eGPU enclosure works at idle (nvidia-smi functional) but any CUDA operation causes immediate system hard-lock requiring power cycle. This appears related to GitHub open-gpu-kernel-modules issue #900 (Blackwell GPU over external PCIe).
https://github.com/NVIDIA/open-gpu-kernel-modules/issues/900
Hardware
| Component | Details |
|---|---|
| GPU | NVIDIA GeForce RTX 5080 (GB203) |
| eGPU Enclosure | Sonnet Breakaway Box 850T5 (Thunderbolt 5) |
| Host | Lenovo ThinkPad X1 Carbon Gen 11 |
| CPU | Intel Core i7-1355U |
| BIOS | N3XET62W (1.37) |
| Thunderbolt Controller | Intel Raptor Lake-P Thunderbolt 4 |
| OS | Rocky Linux 10.1 Workstation (clean install) |
| Kernel | 6.12.0-124.13.1.el10_1.x86_64 (PREEMPT_DYNAMIC) |
Driver
-
Version: 590.44.01
-
Source: Official CUDA RHEL10 repository
-
Type: Open kernel modules (kmod-nvidia-open-dkms)
PCIe Link Status
LnkCap: Port #0, Speed 32GT/s, Width x16
LnkSta: Speed 16GT/s (downgraded), Width x4 (downgraded)
Thunderbolt link: 40 Gb/s (2 lanes × 20 Gb/s)
Symptoms
-
GPU detected on PCIe bus at boot
-
nvidia-smi reports GPU correctly and shows idle state (2W, 30°C)
-
Any CUDA operation causes immediate system hard-lock
Minimal Reproducer
# Works - GPU visible and responsive at idle
nvidia-smi
# Hard lock - system freezes immediately, requires power cycle
python3 -c "import torch; x = torch.zeros(1, device='cuda'); print(x)"
System freezes completely - no kernel panic, no Xid error logged, no SysRq response. Requires power cycle to recover.
Required Configuration
Kernel Parameters
pcie_aspm=off
pcie_ports=native
pcie_port_pm=off
intel_iommu=off
pci=assign-busses,realloc,hpbussize=0x33,hpmmiosize=768M,hpmmioprefsize=16G
rd.driver.blacklist=nouveau
rd.driver.blacklist=nova-core
BIOS Settings
-
Kernel DMA Protection: Disabled (required - with it enabled, BARs fail to allocate)
-
Thunderbolt PCIe Tunneling: Enabled
-
Secure Boot: Disabled
Modprobe Configuration
/etc/modprobe.d/nvidia-pm.conf:
options nvidia NVreg_DynamicPowerManagement=0x00
Udev Rules
/etc/udev/rules.d/99-nvidia-no-d3cold.rules:
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", ATTR{power/control}="on", ATTR{d3cold_allowed}="0"
Issues Encountered During Debugging
| Issue | Details |
|---|---|
Without pcie_ports=native |
GPU enters D3cold, driver fails with “Unable to change power state from D3cold to D0” |
| With Kernel DMA Protection enabled | PCIe tunnel limited to 2.5GT/s x4, BAR allocation fails |
| BAR allocation | Requires hotplug resource reservation parameters |
| Driver probe | GPU periodically shows “fallen off the bus” during probe attempts |
dmesg at Boot (Successful Driver Load)
nvidia: loading out-of-tree module taints kernel.
nvidia-nvlink: Nvlink Core is being initialized, major device number 511
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 590.44.01
Relation to Issue #900
Issue #900 documents identical symptoms with RTX 5090 over OCuLink (external PCIe):
-
nvidia-smi works at idle
-
Computational load causes GPU to disconnect/system to crash
-
GSP firmware bootstrap errors noted during driver loading
Both involve Blackwell GPUs over external PCIe interfaces (Thunderbolt in my case, OCuLink in #900). The common factor appears to be Blackwell architecture over non-native PCIe connections.
Attachment
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz (1.4 MB)