eGPU is not detected by nvidia-smi (hotplug)

Hello!
I experience problems trying to set up eGPU video card on Ubuntu 24.04.
My goal is to enable hot-plugging of eGPU, however, no matter what I do it always fails. In the beginning I had PCI error indicating that there are no free addresses, after tuning the configs my eGPU is recognized by OS, visible in lspci, but nvidia-smi still fails to detect the card. Since I do not know how to proceed further I ask for a help there :)

The things I have done so far:

  1. Enabled ‘AllowExternalGpus’ option.
  2. Tweaked kernel parameters to: quiet splash pci=realloc,assign-busses,hpbussize=0x10,hpmmiosize=32M,hpmmioprefsize=256M
  3. Played with BIOS parameters (disabled security and secureboot).

Currently whenever I plug the eGPU, I see that it is sometimes correctly recognized and addressed, visible by lspci:

sashamikoff@sashamikoff-ThinkPad-T480s:~$ lspci | grep -i nvi
01:00.0 3D controller: NVIDIA Corporation GP108M [GeForce MX150] (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2070] (rev a1)
0a:00.1 Audio device: NVIDIA Corporation TU106 High Definition Audio Controller (rev a1)
0a:00.2 USB controller: NVIDIA Corporation TU106 USB 3.1 Host Controller (rev a1)
0a:00.3 Serial bus controller: NVIDIA Corporation TU106 USB Type-C UCSI Controller (rev a1)

but it is never recongized by nvidia-smi.
If I restart the machine with the eGPU plugged in, it is visible to nvidia-smi and I can use it.

What should I do next? What are the other options?
nvidia-bug-report.log.gz (353.1 KB)

What I also see in the logs is:

[ 2646.407763] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x22:0x40:762)
[ 2646.407845] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 1
[ 2646.608683] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x22:0x40:762)
[ 2646.608770] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 1

These messages pop up from time to time.

I am having the same issue. My setup is Ubuntu 22.04 + Razer Core X with RTX 3090.

[  375.608912] pci 0000:0a:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can't assign; no space
[  375.608915] pci 0000:0a:00.0: BAR 1 [mem size 0x10000000 64bit pref]: failed to assign
[  375.608917] pci 0000:0a:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can't assign; no space
[  375.608919] pci 0000:0a:00.0: BAR 3 [mem size 0x02000000 64bit pref]: failed to assign
[  375.608922] pci 0000:0a:00.0: BAR 0 [mem 0xc5000000-0xc5ffffff]: assigned
[  375.608933] pci 0000:0a:00.0: ROM [mem 0xc4800000-0xc487ffff pref]: assigned
[  375.608936] pci 0000:0a:00.2: BAR 0 [mem 0xc4880000-0xc48bffff 64bit pref]: assigned
[  375.608969] pci 0000:0a:00.2: BAR 3 [mem 0xc48c0000-0xc48cffff 64bit pref]: assigned
[  375.609002] pci 0000:0a:00.1: BAR 0 [mem 0xc48d0000-0xc48d3fff]: assigned
[  375.609014] pci 0000:0a:00.3: BAR 0 [mem 0xc48d4000-0xc48d4fff]: assigned
[  375.609031] pci 0000:0a:00.0: BAR 5 [io  size 0x0080]: can't assign; no space
[  375.609034] pci 0000:0a:00.0: BAR 5 [io  size 0x0080]: failed to assign

The laptop’s bios only has 32bit resources enabled, not sufficient for a third gpu. You might check if you can disable the internal mx150 completely tofree up address space.

Thanks for log analysis.
I have two questions:

  1. If the problem is in BIOS, why does the hotplug works in Windows?
  2. Could it be enough to switch off the internal mx150 during Linux startup, or BIOS-level switch off is required?

Ok, after playing a little bit around it and attaching my eGPU not through the docking station, but directly to a laptop I am able to see the card. There are now any errors in all the logs.
The card, however, is still absent from nvidia-smi.

nvidia-bug-report.log.gz (170.0 KB)

0000:07:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU106 [GeForce RTX 2070] [10de:1f02] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd TU106 [GeForce RTX 2070] [1458:37d5]
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 18
	IOMMU group: 15
	Region 0: Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at a0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at b0000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at c5000000 [virtual] [disabled] [size=512K]```

BAR 5 (io registers) still can’t be assigned.