RTX 3090 eGPU over TB3 randomly falls off bus on Ubuntu 22.04.5 / Dell XPS 13 9340

I’ve been using an eGPU setup for a couple of years now (RTX 3090 in a Razer Core X Chroma over Thunderbolt) and only recently started getting random GPU disconnects mid-workload.

The failure is intermittent: sometimes after ~30 minutes, sometimes only after ~20 hours. It can happen under low VRAM usage too (not only high load).

Specs:

  • Laptop: Dell XPS 13 9340
  • OS: Ubuntu 22.04.5 LTS
  • Kernel: 6.5.0-45-generic
  • eGPU enclosure: Razer Core X Chroma (TB3)
  • GPU: NVIDIA RTX 3090
  • Driver stack currently installed: 580.126.09 (nvidia-driver-580-open)

Very recent changes that I can recall before the issue started

  1. NVIDIA driver update (via unattended-upgrades):
    • 580.95.05 → 580.126.09
  2. Dell BIOS update:
    • 1.21.0 → 1.23.0
  3. I also cleaned laptop vents externally with compressed air (see below why I mention this).

Mid-workload the gpu apparently falls off the bus and I get “no devices were found”. Kernel logs repeatedly show this sequence:

  1. Many corrected AER Data Link errors (BadDLLP) on pcieport 0000:02:01.0 (Intel JHL6540 TB3 bridge)
  2. Then fatal AER (DLP) and link reset/recovery failure
  3. Then NVIDIA:
    • Xid 79, GPU has fallen off the bus
    • Xid 154, recovery action … Node Reboot Required

I attached nvidia-bug-report logs (baseline and after drop). It appears to have something to do with the physical TB connection?

Has anyone seen this exact TB/eGPU pattern on 580.126.09 (open driver) on Ubuntu 22.04, and is there a recommended driver branch/workaround to test first?

P.S.

I have to use open drivers because I have a second egpu setup with a 5090 and afik the blackwell architecture doesn’t work without the open drivers.

nvidia-bug-report.baseline.log (5.2 MB)

nvidia-bug-report.after-drop.log (5.6 MB)

There is a pretty good chance that your TB cable is dying after a few years: have you tried another?

Thanks for the follow-up.

Yep, cable is my first test. If it still persists, do you see another likely root cause (e.g., laptop TB controller/retimer path or ports)?

If it’s not the cable, then it may be literally any other component with generally similar probability: very hard to tell.

It’s rather not a software issue, because that would affect many ppl roughly at the same time and there was no sudden massive amount of failure reports on egpu.io etc. The only software component involved IMO may be this DELL firmware upgrade, because that may be affecting only your model for example, so you may check if downgrading back helps.

1 Like

Hey @georg9alem ,

were you able to find the Root cause of your issue?

I am using the Razer Core X with a 5060ti and Ubuntu 24.04 on a Minisforum UM790Pro and getting exact the Same AER Issues in my journalctl. I am also using the 580-Open driver.

sometimes the PC crashes - This could Happen After 2 mins and sometimes After Hours.

I am very interested in your Feedback what was your fix in the past or maybe the Razer Core X was just defect/broken.

Would be very happy for a reply :)

Thanks in Advance!

You may want to check out the release notes on the latest driver, 595.45.04: https://www.nvidia.com/en-us/drivers/details/265870/
It sounds like the issue may be fixed?

If your issue still happens on 595.45.04, you’ll want to attach a bug report file for NVIDIA to review.

Hey @catt and thanks for your fast reply.
I’ve installed the last driver 595.58.03 from the Nvidia repos, but my error is still occurring.

So the issue of the AER and BadDLLP is still persisting.

But thanks for your help at all!

Mär 31 12:23:14 brun-ki01 kernel: pcieport 0000:00:04.1: AER: Correctable error message received from 0000:65:01.0

Mär 31 12:23:14 brun-ki01 kernel: pcieport 0000:65:01.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)

Mär 31 12:23:14 brun-ki01 kernel: pcieport 0000:65:01.0: device [8086:15da] error status/mask=00000080/00002000

Mär 31 12:23:14 brun-ki01 kernel: pcieport 0000:65:01.0: [ 7] BadDLLP

Blackwell is known to be extremely unstable as a TB eGPU: see egpu.io forum for dozens of similar problem reports. This is most probably due to its low tolerance for PCIe signal latency.