Crashing eGPU

Hi,

I added an eGPU to my Linux host for some IA/LLM applications. I would enjoy using it for ffmpeg also.

But since a few days, I cannot get a stable setup using a RTX 5060 Ti. Which I have set up via Oculink, a free nvme slot and an ADT-Link F9G adapter (4x 4.0 PCIe). Other people seem happy with such setups.

I believe I have tried all available open drivers (575, 570…). On either the host or inside VMs. Recent 576.52 drivers for Windows 11 inside a VM also.

Currently, I have:

  • that eGPU setup with a 650W PSU
  • PC/host with Ryzen 9 7945HX(16C/32T) with 96GB of RAM
  • working KVM and pci passtrough for the VMs
  • a nice Linux/Debian 12 VM
  • a nice updated Windows 11 VM

For other usages, without the eGPU, that PC and the VMs run fine.

Inside the Debian 12 VM, after startup, I can successfully run gpu_burn for an hour. For such an initial run, nvidia-smi or nvtop outputs are fine. But if I repeat this burn test, I do quickly get “GSP timeout” crash or such, then I need to cold start the PC.

ollama (latest version) may often (often only) load a small LLM model into the GPU RAM, then it quickly fails or hangs also.

Some CUDA samples like for PCIe bandwidth tests I had used are crashing also.

In the Windows 11 VM, after start up, the driver status and the outputs of nvidia-smi seems fine also; with some Windows processes loaded and running on the eGPU. But if I do use a little bit that VM (using Firefox via the VM console, or RDP for remote connection), I do get a blue screen at the console within seconds. ollama in that VM kills it also.

So far, I couldn’t figure out what could be wrong with the softwares (drivers, kernels, …) I’m using, or if I have any hardware related issue with that eGPU setup. I don’t have another PC to test that 5060 Ti.

I’ve now ordered another Oculink and nvme adapter to rule out any hardware/cable/connections issue on that side.

What else could be done to diagnose or fix such a crashing setup?

Best regards

PS: I do not share any debugs for the time being as they would be needless in case of faulty hardware related crashes.

This is from inside the Win 11 WM, working fine untill I start to use some graphics stressing applications:

>nvidia-smi
Mon Jun 16 20:07:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.52                 Driver Version: 576.52         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti   WDDM  |   00000000:05:00.0 Off |                  N/A |
|  0%   36C    P8              4W /  180W |      84MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1052    C+G   ...yb3d8bbwe\Notepad\Notepad.exe      N/A      |
|    0   N/A  N/A            5864    C+G   ...yb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A            7072    C+G   C:\Windows\explorer.exe               N/A      |
|    0   N/A  N/A            7924    C+G   ...y\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A            7948    C+G   ..._cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A            8044    C+G   ...cw5n1h2txyewy\WidgetBoard.exe      N/A      |
+-----------------------------------------------------------------------------------------+


> nvidia-smi pmon
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command
# Idx           #    C/G      %      %      %      %      %      %    name
    0       1052   C+G      -      -      -      -      -      -    Notepad.exe
    0       5864   C+G      -      -      -      -      -      -    WindowsTerminal.
    0       7072   C+G      -      -      -      -      -      -    explorer.exe
    0       7924   C+G      -      -      -      -      -      -    StartMenuExperie
    0       7948   C+G      -      -      -      -      -      -    SearchHost.exe
    0       8044   C+G      -      -      -      -      -      -    WidgetBoard.exe
    0       1052   C+G      -      -      -      -      -      -    Notepad.exe
    0       5864   C+G      -      -      -      -      -      -    WindowsTerminal.
    0       7072   C+G      -      -      -      -      -      -    explorer.exe
    0       7924   C+G      -      -      -      -      -      -    StartMenuExperie
    0       7948   C+G      -      -      -      -      -      -    SearchHost.exe
    0       8044   C+G      -      -      -      -      -      -    WidgetBoard.exe
    0       1052   C+G      -      -      -      -      -      -    Notepad.exe


> nvidia-smi pci -gCnt
GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-515f7ee1-338e-94e0-ab03-3c881720a7e2)
    TX_BYTES:                430684048
    RX_BYTES:                290999516


> nvidia-smi pci -gErrCnt
GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-515f7ee1-338e-94e0-ab03-3c881720a7e2)
    REPLAY_COUNTER:          0
    REPLAY_ROLLOVER_COUNTER: 0
    L0_TO_RECOVERY_COUNTER:  0
    CORRECTABLE_ERRORS:      0
    NAKS_RECEIVED:           0
    RECEIVER_ERROR:          0
    BAD_TLP:                 0
    NAKS_SENT:               0
    BAD_DLLP:                0
    NON_FATAL_ERROR:         0
    FATAL_ERROR:             0
    UNSUPPORTED_REQ:         0
    LCRC_ERROR:              0
    LANE_ERROR:
         lane  0: 0
         lane  1: 0
         lane  2: 0
         lane  3: 0
         lane  4: 0
         lane  5: 0
         lane  6: 0
         lane  7: 0
         lane  8: 0
         lane  9: 0
         lane 10: 0
         lane 11: 0
         lane 12: 0

I’ll try with different hardware, different cablings, and another dock also.

I’ve now added again a 575.57.08 driver to the host/PC. Which now runs much better… that way, the AI ending speaking out weirdly (but without so far a PC crash, which should follow later).

I’ve reached a much stable state by forcing the PCIe link speed to Gen 1 (max 2.5GT/S). So I do see my setup/HW ends to work in a degraded mode. Sure, the driver an ollama weren’t designed to deal with PCIe errors/restarts…

Seen elsewhere: " L0_TO_RECOVERY_COUNTER, Times the PCIe link dropped to recovery mode, High values here = serious PCIe link stability problem."

>>> still there?
ehpot ai hhet now fur bot gutedchgua dot-h been in my passed please dhmy oro filled~

# lspci -vvv -s 03:00.0 | grep -i lnksta
                LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

# nvidia-smi pci -gErrCnt
GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-515f7ee1-338e-94e0-ab03-3c881720a7e2)
    REPLAY_COUNTER:          0
    REPLAY_ROLLOVER_COUNTER: 0
    L0_TO_RECOVERY_COUNTER:  56
    CORRECTABLE_ERRORS:      0
    NAKS_RECEIVED:           0
    RECEIVER_ERROR:          0
    BAD_TLP:                 0
    NAKS_SENT:               0
    BAD_DLLP:                0
    NON_FATAL_ERROR:         0
    FATAL_ERROR:             0
    UNSUPPORTED_REQ:         0
    LCRC_ERROR:              0
    LANE_ERROR:
         lane  0: 0
         lane  1: 0
         lane  2: 0
         lane  3: 0
         lane  4: 0
         lane  5: 0
         lane  6: 0
         lane  7: 0
         lane  8: 0
         lane  9: 0
         lane 10: 0
         lane 11: 0
         lane 12: 0

This adapter does not have redrivers, so in general whether it will work at all and how stable the signal will be, is a subject to a silicon lottery. There were a lot of ppl for whom it did not work at all, see the comments in this build on egpu.io for an example.

This means the PCIe signal integrity is still very low. What adapter are you using now? Does it have redrivers? …or is it still F9G?
For reference, this post on eGPU.io describes various “generations” of OCuLink eGPU adapters and how they try to deal with signal integrity. So far the only way that could be described as “somewhat reliable” is using PCIe redrivers. Currently there are very few models available on the market that do include redrivers, the most popular are Minisforum DEG1 and EXP-GDC OCuP4v2, so if you are using an adapter without redrivers, I’d recommend upgrading to 1 of these 2.
If you are using an adapter that does have redrivers and you are still not able to get a stable signal at speeds close to those described in the perf table (“-TGX, -OLx4, -XG, -LUA” row), then you may try to use an M.2-to-OCuLink module that also includes redrivers, such as ADT-F4Q: have a look at OCuLink section at Debian eGPU wiki for details.

PS update: completely out of curiosity: what is your host OS? (as you described only guest VMs’ OSes) Thanks! :)

Thanks for your input. I was believing this could work much better. So and before posting here, I had spent hours on softwares and testing softwares/versions… Until I thought at forcing/clocking down the PCIe speed.

The host itself is common, Debian Trixie and kernel 6.14.10 (I needed > 6.13 for another purpose). Should be fully transparent for PCIe pass trough to VMs. Stock Debian 12 or Ubuntu 2404 LTS should do the job for either host or inside VMs.

One of main real software issue I had hit was using Windows 11, an up to date VM, where I was getting error code 43, the Nvidia driver/GPU remained disabled. I had to reinstall a fresh VM, then the Nvidia driver before updating that VM; so it worked, error 43 was gone. That setup/VM being stable as long I don’t stress it with graphics (using Firefox and YT or so) (this crashing the VM within seconds with my current hardware).

I’m still using the ADT-Link F9G, but now forced to 1x / 2.5GT/s. That way, my system ended reporting almost no PCI errors and ollama could run/load/unload small models without crashing the box. Once crashed, nvidia-smi or nvtop are also hanging, the bug-report.sh script hangs also. Outputs of dmesg often look bad, suggest to powercycle the GPU.

I didn’t test at 1x with more intensive graphic loads and Windows, that makes no sense as it would still end crashing late and randomly. Looks like softwares/drivers don’t deal with PCIe lane errors. Using the Windows VM, CPU itself starts to get loaded, then I’m getting a blue screen… and no much events in Windows logs.

My current setup not being the best as the issue seems now signals integrity related, at higher 4x speeds. With maybe still some integrity errors at 1x. I’ve a 50cm cable length made of 20cm from the nvme adapter to panel connector, then a 30cm cable from panel to the dock. This is too long and the panel connectors in the middle sure also alters the signals.

I’ll have to check and see if I can manage a 30 or shorter 20cm cable/link length. Plus I ordered a Minisforum DE1. I should notice differences using Windows and graphics (more intensive PCIe loads).

So far, I didn’t find a nvme adapter with “redrivers”, line buffers/ signals reshapers. The ADT-F4Q doesn’t seem equiped with, nor does it seem promoted as having some… The Minerva DP6303 could be a option but would need switches tuning. I’ll now first try with the DEG1 and maybe a 30cm link only.

I won’t test, won’t be suitable for me: ADT-Link F9G has an equivalent with a flat ribbon to nvme adapter. This is probably better than oculink for signal integrity (no connectors). This might allow a 20 to 15cm ribbon run with the GPU inside the PC box.

I’m using a mini itx MB, the 16x slot being fitted with a 8x LAN card.

You must have not looked carefully ;-]

I’m almost sure it will resolve your signal integrity problems. DEG1 comes with an OCuLink cable is known to not work well with other cables, so definitely try the included one first.

I had missed that point about ADT-F4Q. The docs are unclear or nonexistent. I was seeing some capacitors, but that could be chips, line drivers. I ordered one, to see and test.

I ended also thinking at pulling out my LAN card (2x10GE running fine at full rate bidir) to sit in the 5060 Ti. But I’m never gaming, I think I’ll never need 16x at 5.0 for the GPU. Using ollama, models are loading at approx 2-3gbps only… the limit anyway being my SSD: 7gbps.

I shall get now soon the DEG1.

I got the DEG1, a nice piece of hardware. Came with a Oculink cable, which I plugged on an basic nvme adapter (no electronics).

Started the Windows VM, connected to… enabled in there the 5060 Ti, then it crashed. I had again to powercycle the setup.

Following the reboot, the device seems fine, at 16GT/S, x4:

# lspci -vvvs 03:00.0 | grep Speed
                LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 unlimited
                LnkSta: Speed 16GT/s (downgraded), Width x4 (downgraded)
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-

Loading Nvidia open driver, speed went down to 1x…

# modprobe nvidia
# nvidia-smi
Wed Jun 18 18:18:19 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:03:00.0 Off |                  N/A |
|  0%   33C    P0             24W /  180W |       0MiB /  16311MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
# lspci -vvvs 03:00.0 | grep Speed
                LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 unlimited
                LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-

I did not force Gen 1, and was expecting to keep 16GT/s. Seems something wents wrong with that dock also. I notice again errors on the PCI link:

# nvidia-smi pci -gErrCnt
GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-515f7ee1-338e-94e0-ab03-3c881720a7e2)
    REPLAY_COUNTER:          0
    REPLAY_ROLLOVER_COUNTER: 0
    L0_TO_RECOVERY_COUNTER:  81
    CORRECTABLE_ERRORS:      0
    NAKS_RECEIVED:           0
    RECEIVER_ERROR:          0
    BAD_TLP:                 0
    NAKS_SENT:               0
    BAD_DLLP:                0
    NON_FATAL_ERROR:         0
    FATAL_ERROR:             0
    UNSUPPORTED_REQ:         0
    LCRC_ERROR:              0
    LANE_ERROR:
         lane  0: 0
         lane  1: 0
         lane  2: 0
         lane  3: 0
         lane  4: 0
         lane  5: 0
         lane  6: 0
         lane  7: 0
         lane  8: 0
         lane  9: 0
         lane 10: 0
         lane 11: 0
         lane 12: 0

ollama failed to detect/use the gpu, msg=“error looking up nvidia GPU memory” error=“cuda driver library failed to get device context 719”, msg=“no compatible GPUs were discovered”. At PCI level, errors have increased:

    L0_TO_RECOVERY_COUNTER:  164

The DEG1 is designed to match some Minisforum mini PCs, which maybe have specific electronics inside to drive the oculink lines. Some chips like on the ADT-F4Q.

Next days, I’ll pull out the LAN card to experiment with the 16x MB slot.

Edit… That were my initial quick observations:

The GPU is now in the 16x PCI 5.0 slot… So I had still noticed some errors using Linux, retrainings and a speed downgraded from Gen 5 first to Gen 1. ollama worked, using the GPU.

My Windows WM runs now fine also, doesn’t crash, using the GPU at approx 7%. Which also runs using Gen 1 speed.

So my hardware/retraining/downgrading issue is maybe unrelated to the eGPU dock.

>nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Thu Jun 19 20:45:59 2025
Driver Version                            : 576.52
CUDA Version                              : 12.9

Attached GPUs                             : 1
GPU 00000000:05:00.0
    Product Name                          : NVIDIA GeForce RTX 5060 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Blackwell
    Display Mode                          : Requested functionality has been deprecated
    Display Attached                      : No
    Display Active                        : Disabled
. . .
    PCI
        Bus                               : 0x05
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x2D0410DE
        Bus Id                            : 00000000:05:00.0
        Sub System Id                     : 0x8A111043
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 1
                Device Current            : 1
                Device Max                : 5
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 165999 KB/s
        Rx Throughput                     : 6902 KB/s
. . .
    Utilization
        GPU                               : 7 %
        Memory                            : 5 %
        Encoder                           : 0 %
        Decoder                           : 7 %
        JPEG                              : 0 %
        OFA                               : 0 %
. . .

Edit: and my last ones…

I’ve now the GPU at Gen 5. Seems to change according to loads and power saving settings.

        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : N/A

I’m connecting to that WM remotely, using RDP. Added nvidiaopenglrdp.exe then Fumark (for GL tests). The GPU is so running at 100%/180W producing 224FPS. Windows being so far stable.