Ubuntu 18.04, EGPU, PNY Quadro RTX4000 stop working, already reinstalled

Hello all,

I am working on a Lenovo T490s with a PNY Quadro RTX 4000 connected to it in a Razer Core as eGPU (driver 470). The machine has both a Windows partition and the Ubuntu partition. It just stopped working for no apparent reason. I have now reinstalled Ubuntu several times and tried the many instructions for connecting the eGPU, none of them worked after that. Under Windows it works without problems. It is noticeable that it is listed at lscpi as Nvidia Device and not with type identification. When I enter nvidia-smi, I always get the error message that the driver could not be loaded.
For my last attempt I installed the 470 driver and used egpu-switcher. This usually just results in a login loop , this time I couldn’t even get to the console. So the bugreport was created only with installed driver but without egpu-switcher. I have tried common attempts like blacklist nouveau and remove blacklist nvidia files. Also allow eGPU =True, wayland =off, prime-nvidia etc. I have tried everything, but can’t find the error. With pleasure I reinstall everything go step by step again, if by doing so we can determine the error.
Thanks a lot for help.

nvidia-bug-report.log.gz (194.3 KB)

Please try setting kernel parameter
pci=realloc
or
pci=realloc=off
whichever one helps.

Thanks for your fast help. I tried both parameter with the same result, nvidia-smi shows now “No device found” instead (driver could’t be loaded), lspci shows the same information like before.

Please create a new bug-report.log with pci=realloc set.

nvidia-bug-report.log.gz (3.1 MB)
Sorry for my late reply. Here the new bug-report.

That one was with pci=realloc=off but doesn’t matter.
Does setting
pci=nocrs
or
pci=nocrs pci=realloc
work?

Same “device not found”.

Bug-report with pci=nocrs
nvidia-bug-report_Option1.log.gz (535.1 KB)
Thanks in advance

Bug report with pci=nocrs and pci realloc
nvidia-bug-report_option2.log.gz (526.9 KB)

The issue is that the BARs can’t be assigned to the nvidia gpu due to the upstream pci bridge of the thunderbolt controller doesn’t have a large enough memory window.
lspci of nvidia gpu:

	Memory at b1000000 (32-bit, non-prefetchable) [size=16M]
	Memory at <unassigned> (64-bit, prefetchable)
	Memory at <unassigned> (64-bit, prefetchable)

upstream bridge resources:

[ 0.639556] pci 0000:09:01.0: PCI bridge to [bus 0a]
[ 0.639561] pci 0000:09:01.0: bridge window [io 0x3000-0x3fff]
[ 0.639574] pci 0000:09:01.0: bridge window [mem 0xb0100000-0xb20fffff]

The nvidia gpu wants 256MB+32MB

[ 0.593861] pci 0000:0a:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[ 0.593862] pci 0000:0a:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[ 0.593864] pci 0000:0a:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[ 0.593866] pci 0000:0a:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]

the odd thing being the bridge upstream of 09:01.0 having a large enough windows but doesn’t propagate it downstream:

[ 0.639598] pci 0000:08:00.0: PCI bridge to [bus 09-39]
[ 0.639616] pci 0000:08:00.0: bridge window [mem 0xb0100000-0xc30fffff] <–304MB

I suspect this came from a bios update though I’m puzzled why this still works with windows.

Bios update could really fit as a cause as we have had a few updates to our university computers lately. Unfortunately, I can’t say how well it runs under Windows because I didn’t run the benchmark until after the updates. It could be that the performance is significantly worse. I only see the differences between the benchmark with and without eGPU at the current time.

Unfortunately, I don’t know if it is possible to undo the bios updates. Is there another way to solve the problem?

Thanks in advance.

Please check if upgrading the kernel using the liquorix ppa fixes it, otherwise you’re left to contact lenovo about this, I guess.

nvidia-bug-report.log.gz (182.7 KB)

Ok i upgraded the kernel, but for me it looks like it is still the same problem. Am i right?

Yes, no change in resource allocations.

Maybe also try resetting bios to defaults.

I tried to find a solution at Lenovo, before resetting the bios I turned off the “Thunderbolt BIOS Assist Mode”. This resulted in the error message “no Device” or “no driver” for almost all kernel parameters (pci =realloc (=off) and/or pci=nocrs). Except when I only used pci=realloc=off then I got the error message “Failed to initialize NVML: Unknown Error”. Bug report is created in this state. When the GPU was still working I got this error every now and then, then disconnect and reconnect and it worked. Unfortunately this solution does not work this time.
nvidia-bug-report.log.gz (264.3 KB)

No change. You could try kernel parameters
pci=realloc, hpmmioprefsize=300M,pcie_scan_all

So after trying everything, I decided to downgrade the bios. After the downgrade I get the graphics card again displayed at nvidia-smi. I now have a login loop again but unplugging the graphics card, logging in and plugging it in again seems to work. If there is a solution for the login loop, that would be very interesting, but I am now happy to work with the GPU again. I have attached a final bug report for review.
nvidia-bug-report.log.gz (456.4 KB)

Thanks for all the help.

That’s good news. The login loop comes from the fact that you set your nvidia gpu as primary gpu (prime-select nvidia) but egpus are disabled per default. Two possibilities, you can either enable egpus for graphics by creating /etc/X11/xorg.conf.d/11-nvidia-egpu.conf

Section "OutputClass"
    Identifier "nvidia-egpu"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowExternalGpus" "True"
EndSection

or you can disable the nvidia gpu for use with the Xserver by creating /etc/X11/xorg.conf

Section "Device"
  Identifier "iGPU"
  Driver "modesetting"
  BusID "PCI:0:2:0"
EndSection

so always only the intel onboard igpu is used.

Comparing the resource allocation, lenovo really borked it. With the old bios, bridge 8 has a mem window of 544MB, bridge 9 uses a 296MB window, which fits. With the new bios, bridge 8 was reduced to a 304MB window but bridge 9 increased to wanting a 384MB window, which didn’t fit.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.