NVidia Quadro P5000 in a Razer Core X Chroma eGPU fails to initialize

I’m trying to make my NVidia Quadro P5000 work in a Razer Core X Chroma eGPU, but it just fails to initialize. (The GPU had been working fine for years in a regular desktop machine.) What could be wrong? I’m out of ideas.

On a Desktop

Motherboard: ASRock x570 Creator
CPU: AMD Ryzen 3950X
System: ArchLinux with kernel 5.9.11
GPU in the on-board PCIe: AMD Radeon Pro W5700
Related kernel flags: pci=realloc,assign-busses,hpbussize=0x33 radeon.auxch=1 mem_encrypt=on

Without the pci=... flag, Thunderbolt devices don’t work. With the flag they appear to work just fine (tested e.g. with a Lenovo Thunderbolt 3 dock).

Here’s a dmesg output when I plug in the eGPU. The most relevant part might be:

Nov 30 16:44:33 charon kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Nov 30 16:44:33 charon kernel: nvidia 0000:3d:00.0: enabling device (0000 -> 0003)
Nov 30 16:44:33 charon kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:3d:00.0)
Nov 30 16:44:33 charon kernel: NVRM: The system BIOS may have misconfigured your GPU.
Nov 30 16:44:33 charon kernel: nvidia: probe of 0000:3d:00.0 failed with error -1
Nov 30 16:44:33 charon kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 30 16:44:33 charon kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 30 16:44:33 charon kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 234

I’ve searched for the error messages. Starting from the NVidia forums (1) (2), I’ve double-checked that

  • Above 64b decoding is enabled in my UEFI setup and
  • I do have at least one 64-bit window (>8 hex digits) earlier in dmesg:
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io  0x0000-0x03af window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io  0x03e0-0x0cf7 window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io  0x03b0-0x03df window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000dffff window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0xb0000000-0xefffffff window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0x2050000000-0x7fffffffff window]
    Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [bus 00-ff]
    

The eGPU appears normally in boltctl list (authorized etc.). NVidia Quadro P5000 appears in lspci. However, nothing else works, neither the NVidia itself nor the USB hub(s) (with a built-in ASIX ethernet) in the eGPU.

Some threads recommended /sys/bus/pci/devices gymnastics, such as this post, but that not only doesn’t work for me, but this crash from 2015 still crashes my machine today — my system freezes and panic-reboots when I try that. So I haven’t experimented any further.

On a Laptop

Machine: Lenovo X1 Carbon v7
CPU: Intel Core i7-8665U
System: Debian with kernel 5.9.8
Related kernel flags: pci=noats

Importantly, the laptop does not have the NVidia driver installed — some forum posts explicitly asked for dmesg without the NVidia driver. So here it is — a dmesg output from the laptop without NVidia drivers.

Again, boltctl list looks normal (authorized etc.). The NVidia Quadro P5000 appears in lspci. The difference from the desktop case above is that at least something works — the USB buses and the ASIX network card (ax88179_178a). But the NVidia card doesn’t work — “no space for” occurs a number of times in dmesg.

For the record, I have just re-posted this also on two Razer forums (1), (2) for more visibility and also because I’ve found some (heavily ambiguous) information on the web that casts doubt on the compatibility of my Core X Chroma with NVidia Quadro P5000.

If any of the threads produces an answer, I’ll update the other threads.

Update: I’ve managed to make it work on the laptop. Despite the issues reported in dmesg, enabling the NVidia driver made it work. I have the module loaded, nvidia-smi shows something reasonable and I can offload applications using __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only __GLX_VENDOR_LIBRARY_NAME=nvidia <command>, as described here.

Remaining problems:

  • It works fine for glxgears (uh oh) and (e.g.) for stellarium, but google-chrome yields a 100% black window.
  • It’s terribly slow; acceleration on the Intel GPU is way faster. I think this is because I have a 4k laptop display, a 5k Thunderbolt monitor connected to a TB port of the laptop (daisy-chained through a dock, actually) and the eGPU connected to the other TB port on the laptop. So there may be (?) a shortage of bandwidth somewhere in the setup (laptop → eGPU → laptop → dock → monitor). (I can’t connect the monitor directly to the eGPU, because the P5000 doesn’t have a Thunderbolt.)
  • The desktop: It just doesn’t work. But I think this narrows the possible causes down to the ASRock x570 Creator motherboard. In hindsight I should have asked ASRock rather than NVidia and Razer.

Résumé: I’ll ask ASRock about this. Perhaps this is an inherent limitation of the CPU/chipset that wouldn’t allow an additional GPU when there is already a GPU in a slot (AMD Radeon Pro W5700), Thunderbolt is enabled, both M2 slots have SSDs in them etc.

Alright, I’ve figured it out, based on this post. The magic is:

pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=128M,hpmmioprefsize=16G

With this^^^ on the kernel command line, I can just plug in the eGPU an it works, no problem at all. The nvidia kernel module loads correctly and I’m calculating Folding@Home on the eGPU right now, so it definitely works.

(My machine won’t boot if I add the recommended nocrs to pci=..., because the kernel can’t talk to SATA controllers and drives in that mode and freezes forever while trying to do so. But the eGPU works without nocrs just fine, so I’m not messing with that any further.)