590 release feedback & discussion

I sent the following to NVIDIA and Sonnet support, but posting here too in case anyone is running into similar problems with the RTX 5080 on Linux using the official drivers from the CUDA rhel10 repo, both 580 and 590.

Summary

RTX 5080 connected via Thunderbolt 5 eGPU enclosure works at idle (nvidia-smi functional) but any CUDA operation causes immediate system hard-lock requiring power cycle. This appears related to GitHub open-gpu-kernel-modules issue #900 (Blackwell GPU over external PCIe).

https://github.com/NVIDIA/open-gpu-kernel-modules/issues/900

Hardware

Component Details
GPU NVIDIA GeForce RTX 5080 (GB203)
eGPU Enclosure Sonnet Breakaway Box 850T5 (Thunderbolt 5)
Host Lenovo ThinkPad X1 Carbon Gen 11
CPU Intel Core i7-1355U
BIOS N3XET62W (1.37)
Thunderbolt Controller Intel Raptor Lake-P Thunderbolt 4
OS Rocky Linux 10.1 Workstation (clean install)
Kernel 6.12.0-124.13.1.el10_1.x86_64 (PREEMPT_DYNAMIC)

Driver

  • Version: 590.44.01

  • Source: Official CUDA RHEL10 repository

  • Type: Open kernel modules (kmod-nvidia-open-dkms)

PCIe Link Status

LnkCap: Port #0, Speed 32GT/s, Width x16
LnkSta: Speed 16GT/s (downgraded), Width x4 (downgraded)

Thunderbolt link: 40 Gb/s (2 lanes × 20 Gb/s)

Symptoms

  1. GPU detected on PCIe bus at boot

  2. nvidia-smi reports GPU correctly and shows idle state (2W, 30°C)

  3. Any CUDA operation causes immediate system hard-lock

Minimal Reproducer

# Works - GPU visible and responsive at idle
nvidia-smi

# Hard lock - system freezes immediately, requires power cycle
python3 -c "import torch; x = torch.zeros(1, device='cuda'); print(x)"

System freezes completely - no kernel panic, no Xid error logged, no SysRq response. Requires power cycle to recover.

Required Configuration

Kernel Parameters

pcie_aspm=off
pcie_ports=native
pcie_port_pm=off
intel_iommu=off
pci=assign-busses,realloc,hpbussize=0x33,hpmmiosize=768M,hpmmioprefsize=16G
rd.driver.blacklist=nouveau
rd.driver.blacklist=nova-core

BIOS Settings

  • Kernel DMA Protection: Disabled (required - with it enabled, BARs fail to allocate)

  • Thunderbolt PCIe Tunneling: Enabled

  • Secure Boot: Disabled

Modprobe Configuration

/etc/modprobe.d/nvidia-pm.conf:

options nvidia NVreg_DynamicPowerManagement=0x00

Udev Rules

/etc/udev/rules.d/99-nvidia-no-d3cold.rules:

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", ATTR{power/control}="on", ATTR{d3cold_allowed}="0"

Issues Encountered During Debugging

Issue Details
Without pcie_ports=native GPU enters D3cold, driver fails with “Unable to change power state from D3cold to D0”
With Kernel DMA Protection enabled PCIe tunnel limited to 2.5GT/s x4, BAR allocation fails
BAR allocation Requires hotplug resource reservation parameters
Driver probe GPU periodically shows “fallen off the bus” during probe attempts

dmesg at Boot (Successful Driver Load)

nvidia: loading out-of-tree module taints kernel.
nvidia-nvlink: Nvlink Core is being initialized, major device number 511
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 590.44.01

Relation to Issue #900

Issue #900 documents identical symptoms with RTX 5090 over OCuLink (external PCIe):

  • nvidia-smi works at idle

  • Computational load causes GPU to disconnect/system to crash

  • GSP firmware bootstrap errors noted during driver loading

Both involve Blackwell GPUs over external PCIe interfaces (Thunderbolt in my case, OCuLink in #900). The common factor appears to be Blackwell architecture over non-native PCIe connections.

Attachment

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz (1.4 MB)

2 Likes