Driver Install Successful but nvidia-smi Finds No Devices

Hi,

I have a GTX 1070 eGPU which is connected to my Ubuntu 20.04 computer via Thunderbolt3.
I successfully authenticated the eGPU and installed the 460-drivers using the Ubuntu additional drivers dialogue but nvidia-smi just says No devices were found.
nvidia-settings displays the selection dialogue but crashes upon selection reading

ERROR: Unable to load info from any available system


(nvidia-settings:2978): GLib-GObject-CRITICAL **: 13:18:26.012: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
** Message: 13:18:26.016: PRIME: Requires offloading
** Message: 13:18:26.016: PRIME: is it supported? yes
** Message: 13:18:26.049: PRIME: Usage: /usr/bin/prime-select nvidia|intel|on-demand|query
** Message: 13:18:26.049: PRIME: on-demand mode: "1"
** Message: 13:18:26.049: PRIME: is "on-demand" mode supported? yes
Segmentation fault (core dumped)

Please also find the nvidia-bug-report.log attached. nvidia-bug-report.log.gz (306.2 KB)

Thank you very much for every help getting the card running.

It’s a MacBook Pro, this is known to rarely work with an egpu over thunderbolt with Linux due to proprietary, nonstadard thunderbolt and uefi.
You might try kernel parameter pci=realloc
All other workarounds are here:
https://github.com/Dunedan/mbp-2016-linux/issues/60#issuecomment-397834729
If none work, you’re out of luck.

Hi, thanks for your reply…
I am not very hopeful, and you’re right, but I figured since I didn’t have the T-security-chip MacBook Pro but the rather well-behaved MacBookPro13,1 it was worth a try.
I also once got the eGPU running under Windows if that changes something…

Try pci=realloc. Windows uses that per default but OTOH, Apple provides thunderbolt drivers through bootcamp.

Using pci=realloc didn’t do anything but make nvidia-smi returned that it couldn’t communicate with the Nvidia driver. Anyway, thanks again. I’m not giving up yet, I will update this thread if I manage to get it to work…

Then you can only use the method from the link
Stop the Xserver
unload nvidia driver (sudo modprobe -r nvidia)
make sure it is unloaded (lsmod |grep nvidia)
get root shell (sudo -s)
remove/add back bridge 1c.4

echo 1 > /sys/bus/pci/devices/0000:00:1c.4/remove
echo 1 > /sys/bus/pci/rescan

load nvidia driver (modprobe nvidia)
check dmesg/nvidia-smi

Hi again, and thank you!
I tried it, but upon executing the step echo 1 > /sys/bus/pci/rescan the level 3 session just outputs

[drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000a00] Failed to allocate NvKmsKapiDevice
[drm:nv_drm_probe_devices [nvidia-drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000a00] Failed to register device

I needed to unload nvidia_drm because it was using nvidia_modeset which was using nvidia, if this could have anything to do with it.

Does the error clarify anything? Could this be a pci bandwith issue akin to error 12 on windows?

Thanks again!

The problem has always been clear, it’s pci resource allocation failure:

[    7.421952] pci 0000:0a:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[    7.421954] pci 0000:0a:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]

The workaround for that on Apple hardware is to set pci=realloc and remove/add back the bridge in hope of the BAR memory reallocation works. If it doesn’t, better forget about it.

Hi again,

it did work!! nvidia-smi shows the expected output. Thank you so much for your most valuable input.
Before, I wasn’t booting with pci=realloc when I did do the reallocation steps remove/rescan.

Restarting gdm however didn’t, initially, i.e. the internal screen could not be accelerated, but prime-select’ing intel did the trick. No I am not sure which screen is being accelerated by which card as I got an external monitor hooked up to the 1070 which is now also working.

Anyway, after rebooting, all is back to how it was. How could I make the change permanent? Writing a script that does this on each reboot?

Thank you very much again.

egpus have to be explicitly enabled to be used in X config:
https://forums.developer.nvidia.com/t/internal-display-freezing-on-startup-with-egpu/170468/4

You will have to write a script that always does the remove/add action on startup, e.g. using a one-shot systemd unit running before display-manager and gpu-manager.

Addendum: it’s often difficult to get the timing right since the nvidia driver takes some 2 seconds to initialise. So if X starts up before that, the nvidia gpu will not be used.