eGPU is not recognized by nvidia-smi in a Nvidia optimus setting

What do I want to to?

I have a Acer Swift-3 Laptop with Thunderbolt 3 USB port. Now I want to use an external GPU for ML computing stuff with pytorch, but unfortunately, it wasn’t possible for me to get this working on my device.

The following setup was newly installed to make following as easy as possible. Currently wayland is still enabled, but in a previous try, I also checked that → same result as below

I really don’t know how to manage it to get the eGPU running. Please help me :)

technical data

  • Ubuntu 22.04.2 LTS
~$ uname -r
5.19.0-35-generic
~$ lspci | grep -E "(VGA|3D)"
00:02.0 VGA compatible controller: Intel Corporation Iris Plus Graphics G7 (rev 07)
03:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
2b:00.0 3D controller: NVIDIA Corporation GP108BM [GeForce MX250] (rev a1)
  • installed driver package: nvidia-driver-525, installed via sudo apt-get install
    • secure boot is available, but key was enrolled after installation
  • Acer Swift 3- SF314-57G
  • Intel Iris Plus Graphics (iGPU)
  • Nvidia Geforce MX250 (dGPU)
  • Nvidia Geforce RTX 3090 (eGPU)

Problem

RmInitAdapter failed! (0x26:0x56:1474)
https://forums.developer.nvidia.com/t/k-ubuntu-22-10-not-booting-kernel-oops-for-driver-450-with-egpu/235008/3?u=generix

  1. Bios version is already up-to-date
  2. using nvidia-driver-525-open results in

    I also set the kernel parameter to nvidia.NVreg_OpenRmEnableUnsupportedGpus=1
  3. I didn’t tried the “driver 470.57-470.82” version you suggested in another thread yet

The 3090 doesn’t get enough resources assigned so the driver fails.
The second message is about your integrated MX250, which doesn’t work with the open driver at all.
To get the 3090 working, please try kernel parameter
pci=realloc
and possibly remove/rescan it on the pcie bus, e.g.

echo 1 > /sys/bus/pci/devices/0000:03:00.0/remove
echo 1 > /sys/bus/pci/rescan

Thank you for the reply!

I’m now running with kernel parameter nvidia.NVreg_OpenRmEnableUnsupportedGpus=1 and pci=realloc but that didn’t made any difference.
After boot I tried the suggested commands to remove/rescan the pcie bus, but this changed nothing. It also doesn’t really remove the GPU (I can still se it via lspci).

According to https://unix.stackexchange.com/a/727720 I turned the MX250 GPU off on booting, such that it is now not showing up via lspci.
The error still seems to be the same:

Maybe interesting: The dmesg log is increasing all the time with the same message. Thus, it seems that nvidia wants to load it all the time while booted.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Unfortunately, I’m not allowed to put 2 urls in one post (and yes, the attachment counts as an url). Additionally, the upload of the .gz file was stuck, so I had to tar it before upload.

nvidia-bug-report.log.tar.gz (929.6 KB)

Please uninstall the nvidia driver, blacklist nouveau, reboot and provide a dmesg output
sudo dmesg > dmesg.txt

  • driver uninstalled
  • kernel parameters removed
  • MX250 GPU still disabled
  • nouveau driver blacklisted
  • rebooted and checked

dmesg.txt (92.7 KB)

Please set kernel parameter pci=realloc and but don’t install the nvidia driver yet.
After reboot, please run

sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:00:07.0/remove"
sudo -sh -c "echo 1 > /sys/bus/pci/rescan"

and create a dmesg output.
Then run

sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove"
sudo -sh -c "echo 1 > /sys/bus/pci/rescan"

and create a second dmesg output. Please attach both outputs.

After adding kernel parameter pci=realloc the reboot fails:

Very interesting, pci=realloc makes your nvme controller vanish. Please remove pci=realloc in the grub menu to boot normally and check for a bios update.

Unfortunately my bios is already up-to-date with version 1.18

Then please create the two dmesg outputs without realloc. Though this likely won’t help.
The pci bridge 00:07.0 has enough space but the downstream thunderbolt controller 01:00.0 only gets 1MB to offer and the nvidia gpu wants 16MB BAR0 space.

I’ve corrected the rescan command to (just for others to follow)

sudo sh -c "echo 1 > /sys/bus/pci/rescan"

dmesg1.txt (102.7 KB)

And because I’m still not allowed to put 2 links in 1 post:
dmesg2.txt (109.0 KB)

Ok, for a last test, please reboot and only run the last remove/rescan and create a new dmesg output.

sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove"
sudo sh -c "echo 1 > /sys/bus/pci/rescan"

I was a little bit pessimistic yesterday so I tried to install windows next-by to Ubuntu. The following windows updates almost crashed my system, but now everything is running. Very interesting to mention is that windows updates comes with an bios update. I really don’t know why. So now my bios version is 1.21 instead of 1.18.

I decided to retry every command you suggested:

  • pci=realloc as kernel parameter → still vanishes my nvme controller → removed again
sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:00:07.0/remove"
sudo sh -c "echo 1 > /sys/bus/pci/rescan"

dmesg1.txt (101.3 KB)

sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove"
sudo sh -c "echo 1 > /sys/bus/pci/rescan"

dmesg2.txt (107.6 KB)

reboot

sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove"
sudo sh -c "echo 1 > /sys/bus/pci/rescan"

dmesg3.txt (110.2 KB)

The tunderbolt controller wants to work but the upstream pci bridge won’t let it without realloc…
Please check if you can boot with
pci=realloc,nocrs
if that also doesn’t work, try
pci=nocrs

With kernel parameter pci=realloc,nocrs Ubuntu is finally able to boot! I’ve reinstalled the nvidia-driver-525-open, rebootet but nvidia-smi now prints “No devices were found”. Something is still failing.
dmesg.txt (104.9 KB)

Thank you very much in advance for taking time and helping me!