I’ve been using an eGPU setup for a couple of years now (RTX 3090 in a Razer Core X Chroma over Thunderbolt) and only recently started getting random GPU disconnects mid-workload.
The failure is intermittent: sometimes after ~30 minutes, sometimes only after ~20 hours. It can happen under low VRAM usage too (not only high load).
Specs:
Laptop: Dell XPS 13 9340
OS: Ubuntu 22.04.5 LTS
Kernel: 6.5.0-45-generic
eGPU enclosure: Razer Core X Chroma (TB3)
GPU: NVIDIA RTX 3090
Driver stack currently installed: 580.126.09 (nvidia-driver-580-open)
Very recent changes that I can recall before the issue started
NVIDIA driver update (via unattended-upgrades):
580.95.05 → 580.126.09
Dell BIOS update:
1.21.0 → 1.23.0
I also cleaned laptop vents externally with compressed air (see below why I mention this).
Mid-workload the gpu apparently falls off the bus and I get “no devices were found”. Kernel logs repeatedly show this sequence:
Many corrected AER Data Link errors (BadDLLP) on pcieport 0000:02:01.0 (Intel JHL6540 TB3 bridge)
Then fatal AER (DLP) and link reset/recovery failure
Then NVIDIA:
Xid 79, GPU has fallen off the bus
Xid 154, recovery action … Node Reboot Required
I attached nvidia-bug-report logs (baseline and after drop). It appears to have something to do with the physical TB connection?
Has anyone seen this exact TB/eGPU pattern on 580.126.09 (open driver) on Ubuntu 22.04, and is there a recommended driver branch/workaround to test first?
P.S.
I have to use open drivers because I have a second egpu setup with a 5090 and afik the blackwell architecture doesn’t work without the open drivers.
If it’s not the cable, then it may be literally any other component with generally similar probability: very hard to tell.
It’s rather not a software issue, because that would affect many ppl roughly at the same time and there was no sudden massive amount of failure reports on egpu.io etc. The only software component involved IMO may be this DELL firmware upgrade, because that may be affecting only your model for example, so you may check if downgrading back helps.
were you able to find the Root cause of your issue?
I am using the Razer Core X with a 5060ti and Ubuntu 24.04 on a Minisforum UM790Pro and getting exact the Same AER Issues in my journalctl. I am also using the 580-Open driver.
sometimes the PC crashes - This could Happen After 2 mins and sometimes After Hours.
I am very interested in your Feedback what was your fix in the past or maybe the Razer Core X was just defect/broken.
Blackwell is known to be extremely unstable as a TB eGPU: see egpu.io forum for dozens of similar problem reports. This is most probably due to its low tolerance for PCIe signal latency.