I am working on a Lenovo T490s with a PNY Quadro RTX 4000 connected to it in a Razer Core as eGPU (driver 470). The machine has both a Windows partition and the Ubuntu partition. It just stopped working for no apparent reason. I have now reinstalled Ubuntu several times and tried the many instructions for connecting the eGPU, none of them worked after that. Under Windows it works without problems. It is noticeable that it is listed at lscpi as Nvidia Device and not with type identification. When I enter nvidia-smi, I always get the error message that the driver could not be loaded.
For my last attempt I installed the 470 driver and used egpu-switcher. This usually just results in a login loop , this time I couldn’t even get to the console. So the bugreport was created only with installed driver but without egpu-switcher. I have tried common attempts like blacklist nouveau and remove blacklist nvidia files. Also allow eGPU =True, wayland =off, prime-nvidia etc. I have tried everything, but can’t find the error. With pleasure I reinstall everything go step by step again, if by doing so we can determine the error.
Thanks a lot for help.
Thanks for your fast help. I tried both parameter with the same result, nvidia-smi shows now “No device found” instead (driver could’t be loaded), lspci shows the same information like before.
The issue is that the BARs can’t be assigned to the nvidia gpu due to the upstream pci bridge of the thunderbolt controller doesn’t have a large enough memory window.
lspci of nvidia gpu:
Memory at b1000000 (32-bit, non-prefetchable) [size=16M]
Memory at <unassigned> (64-bit, prefetchable)
Memory at <unassigned> (64-bit, prefetchable)
[ 0.593861] pci 0000:0a:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[ 0.593862] pci 0000:0a:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[ 0.593864] pci 0000:0a:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[ 0.593866] pci 0000:0a:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
the odd thing being the bridge upstream of 09:01.0 having a large enough windows but doesn’t propagate it downstream:
Bios update could really fit as a cause as we have had a few updates to our university computers lately. Unfortunately, I can’t say how well it runs under Windows because I didn’t run the benchmark until after the updates. It could be that the performance is significantly worse. I only see the differences between the benchmark with and without eGPU at the current time.
Unfortunately, I don’t know if it is possible to undo the bios updates. Is there another way to solve the problem?
I tried to find a solution at Lenovo, before resetting the bios I turned off the “Thunderbolt BIOS Assist Mode”. This resulted in the error message “no Device” or “no driver” for almost all kernel parameters (pci =realloc (=off) and/or pci=nocrs). Except when I only used pci=realloc=off then I got the error message “Failed to initialize NVML: Unknown Error”. Bug report is created in this state. When the GPU was still working I got this error every now and then, then disconnect and reconnect and it worked. Unfortunately this solution does not work this time. nvidia-bug-report.log.gz (264.3 KB)
So after trying everything, I decided to downgrade the bios. After the downgrade I get the graphics card again displayed at nvidia-smi. I now have a login loop again but unplugging the graphics card, logging in and plugging it in again seems to work. If there is a solution for the login loop, that would be very interesting, but I am now happy to work with the GPU again. I have attached a final bug report for review. nvidia-bug-report.log.gz (456.4 KB)
That’s good news. The login loop comes from the fact that you set your nvidia gpu as primary gpu (prime-select nvidia) but egpus are disabled per default. Two possibilities, you can either enable egpus for graphics by creating /etc/X11/xorg.conf.d/11-nvidia-egpu.conf
Comparing the resource allocation, lenovo really borked it. With the old bios, bridge 8 has a mem window of 544MB, bridge 9 uses a 296MB window, which fits. With the new bios, bridge 8 was reduced to a 304MB window but bridge 9 increased to wanting a 384MB window, which didn’t fit.