Driver can't detect eGPU

Hello! I’ve got problems with a Linux Laptop + eGPU setup.

Specs:

  • OS: Fedora 38 KDE Plasma
  • eGPU: RTX 3070 TI
  • Driver version: 535 and 470

I’ve installed drivers following these recommendations and tried to use both Wayland and X11. Tried to use the latest ones and 470 ones. The issue is that driver cannot find eGPU device for some reason even though device is detected by the system:

lspci | grep VGA
52:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070 Ti] (rev a1)

and modules are loaded into the kernel:

lsmod | grep nvidia
nvidia_drm             94208  0
nvidia_modeset       1556480  1 nvidia_drm
nvidia_uvm           3497984  0
nvidia              62734336  83 nvidia_uvm,nvidia_modeset
video                  77824  3 thinkpad_acpi,i915,nvidia_modeset

For configuration and management I also tried to use all-ways-egpu for Wayland, gswitch and egpu-switcher for X11. Attaching artifacts from nvidia-bug-report.sh for both 535 and 470 versions.
Also need to mention that this exact hardware setup was working fine about a year ago, when I just configured it with Ubuntu, a couple months later after a some update it broke and never repaired since then.

nvidia-bug-report-470.log.gz (111.9 KB)
nvidia-bug-report-535.log.gz (128.7 KB)

So, here are my findings:

  • Setting any of the options in GRUB pcie_aspm=off nouveau.modeset=0 nvidia.NVreg_OpenRmEnableUnsupportedGpus=1 didn’t help neither with akmod-nvidia nor with akmod-nvidia-open.
  • Installation of the latest 535 using run file: sudo ./NVIDIA-Linux-x86_64-535.113.01.run -m=kernel-open didn’t work.
  • Found RmInitAdapter failed! error in the logs generated by nvidia-bug-report.sh both for proprietary and open versions. Got compilation errors during my attempts to build 470.82.00 and 515.105.01 driver versions (again using run files) with Fedora Linux (6.5.8-200.fc38.x86_64) kernel, so no luck with that.
  • Going to try the latest at the moment version (545.23.06) using the run file, see no more options if this fails.

nvidia-bug-report-535-open.log.gz (306.4 KB)

An update:

Installing the latest beta driver using run file also didn’t work: sudo ./NVIDIA-Linux-x86_64-545.23.06.run -m=kernel-open. So I currently see virtually no possibilities to make an external GPU work with the latest kernels.

You need to install the -open version of the nvidia driver. The log you provided with -open was using the normal driver.

That’s weird, the log above is from either akmod-nvidia-open package or NVIDIA-Linux-x86_64-535.113.01.run run file (can’t recall exactly), also you can see that /tmp/selfgz5811/NVIDIA-Linux-x86_64-535.113.01/kernel-open path was used along the compilation and [ 146.427273] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 535.113.01 Release Build (dvs-builder@U16-I2-C03-37-4) Tue Sep 12 19:48:46 UTC 2023 was set.

Which log file was containing this? I used nvidia-bug-report-535-open.log.gz and read

nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.113.01  Tue Sep 12 19:45:42 UTC 2023

so it was using the closed driver.

Yeah, this one nvidia-bug-report-535-open.log.gz, you can search for NVIDIA UNIX Open Kernel in this file. There’s just a collection of all my attempts I guess for the past few days :)

That’s from an older runfile installer log and the -open driver seemd to work fine

Okay, I’ll try again the akmod-nvidia-open package and generate the report once again. Practically I’ve got the same result on the open one, the external monitor didn’t get the signal and overall it’s a black screen.

This might be just a timing issue, i.e. the driver loads too late and the Xserver/Wayland is already up. Can’t really tell without a proper log, though.

Okay, here are the new attempt, interesting part starts form line 7k. I’ve booted with nvidia.NVreg_OpenRmEnableUnsupportedGpus=1 parameter set and open driver version was loaded, but I’ve got RmInitAdapter failed!. Also with the OpenRmEnableUnsupportedGpus parameter set I’ve got nothing on lsmod | grep -i nvidia, so modules didn’t load. Without this parameter modules are loaded.

nvidia-bug-report-535-open.log.gz (150.9 KB)

You forgot to properly set `nvidia.NVreg_OpenRmEnableUnsupportedGpus=1"
At least it’s not used.

Hm, yeah, that’s weird, I’ve set it upon the load in GRUB. I’ll pre-set it now and build the new GRUB config.

Yes, now I can see something new, but idk what it means) I think I’d need to create a new issue on GitHub…

Oct 31 17:53:41 fedora kernel: NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
Oct 31 17:53:44 fedora kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:52:00.0 on minor 1
Oct 31 17:53:44 fedora systemd[1]: nvidia-fallback.service - Fallback to nouveau as nvidia did not load was skipped because of an unmet condition check (ConditionPathExists=!/sys/module/nvidia).
Oct 31 17:54:06 fedora kernel: NVRM unixCallVideoBIOS: int10h(4f02, 0000) vesa call failed! (4f02, 0000)
Oct 31 17:54:06 fedora kernel: NVRM nvCheckOkFailedNoLog: Check failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from pRmApi->Control(pRmApi, nv->rmapi.hClient, nv->rmapi.hSubDevice, NV2080_CTRL_CMD_INTERNAL_DISPLAY_POST_RESTORE, &restoreParams, sizeof(restoreParams)) @ unix_console.c:197

Again around like 7k.

nvidia-bug-report-535-open.log.gz (313.9 KB)

Not looking good, something seems to be really wrong with the thunderbolt bios support. Please check for a system bios update.

Alright, I update it periodically, but I’ll check if they got anything new. Again, it was working good a year ago and after a some system update (can’t recall if a BIOS update was also involved) it broke…
Thanks!

Just for info, did you try to restart Xorg to check whether the error messages can be ignored or not?

Yep, I was trying to restart it, but it doesn’t help. Checked everything, updated outdated, but the BIOS version is the latest (N3AET77W (1.42) 2023-09-21). Also eGPU works on Windows just fine.

Yes, the error 0x26,0x56
RmInitAdapter failed! (0x26:0x56:1482)
is specific to the proprietary linux driver. Seems to expect something special from the system bios not available for eGPUs over TB.
Usually the open driver works without an issue in that case. Maybe open an issue on github with the -open driver to shed some light on this.

Got it, yep, I’ve created one here: eGPU kernel modules failure - Chipset Setup Function Error! · Issue #568 · NVIDIA/open-gpu-kernel-modules · GitHub