Nvidia-smi "No device where found"

Hi I have an exotic configuration.
A Mac pro 2013 with two intern AMD GPUs.
Because of the Thunderbolt 2 Interfaces and an available GeForce GTX 1050 Ti I thought I expand the Mac with an eGPU Card.

The eGPU seems to work, it is authorized. The GPU was found by Linux. But nvidia-smi only says “No device where found”.

I tried something in the meantime.
Tried different boot parameters, different xorg configs and different nvidia-driver Versions (470 and 495), Ubuntu and Manjaro

With 495 I got " NVRM: BAR1 is 0M @ 0x0 (PCI:0000:19:00.0)"
Actual I use the driver V470.86

My cmdline:
BOOT_IMAGE=/boot/vmlinuz-5.15-x86_64 root=UUID=9cdd965a-1aae-4df6-9478-eac5e837fda0 rw quiet apparmor=1 security=apparmor udev.log_priority=3 radeon.si_support=0 amdgpu.si_support=1 pcie_ports=native pci=realloc iommu=on

My actual distribution is manjaro.

– lspci -k says me the “nvidia” driver is in use

19:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 3351
Kernel driver in use: nvidia

– dmesg shows this error:
[ 1777.840544] NVRM: GPU 0000:19:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 1777.840614] NVRM: GPU 0000:19:00.0: rm_init_adapter failed, device minor number 0

At the moment I have no ideas anymore and need help.

nvidia-bug-report.log (716.9 KB)

Please see this thread:
https://forums.developer.nvidia.com/t/driver-for-rtx3070-not-working-under-elementary-os-on-macbook-pro-with-egpu/164829?u=generix
With Macs, sometime pci=realloc is enough, sometime you’ll have to do full monty.

Hi,

thanks, I will try it, but one question.
In the link the last post to Number 6

  1. get a root console, remove and add back the pci bridge
    sudo -s
    echo 1 > /sys/bus/pci/devices/0000:00:01.1/remove

My Mac does not have the pci device 00:01.1 to remove.

What device is this? Is this one of the PCI bridges?
00:01.0 - 00.03.0, log file at row 2293?

It’s the pci bridge the nvidia gpu is connected to according to lspci -t
Should be 0000:00:01.0 in your case.

Hi thanks, this seems to be working. Great!! :)

My steps:

  1. Blacklist nvidia
  2. update initrd
  3. disable display-manager
    4 reboot.
    The steps above were the same as in the other posting.
    At 5 I am not sure, if the blacklist worked.
  4. First I needed an ssh access, because with
    “echo 1 > sys/bus/pci/devices/000:00:01.0/remove” my keyboard and mouse was gone.
    per ssh, I have to enter the command twice. After second time, the rescan command worked.
  5. Was not neccesary. nvidia-smi found now a device
  6. created new nvidia-bug report
    nvidia-bug-report.log (1.6 MB)
  7. start display:
    $ systemctl start sddm.service

Ok, it seems to be working now, but what are the next steps? How I get this without step 6?
The nvidia x server settings shows me the GPU on Demand. What’s exactly the difference between “on Demand” and “Performance Mode”? How does prime-select and/or prime-query exactly works?
Any suggestions?

Step 6 was just for debugging so i could see errors in case of failure.
Next step would be creating a systemd unit and a script to have this run automatically on system boot. e.g.

[Unit]
Description=Nvidia GPU initialization
Before=gpu-manager.service

[Service]
Type=oneshot
ExecStart=/usr/bin/egpu.sh
ExecStartPre=

[Install]
WantedBy=display-manager.target

Put the necessary commands into /usr/bin/egpu.sh, check if it works after boot, then re-enable sddm and reboot.
To enable the egpu for the Xserver, see:
https://forums.developer.nvidia.com/t/internal-display-freezing-on-startup-with-egpu/170468/4?u=generix

prime-select
“nvidia” aka “performance mode” means the nvidia gpu will always render everything.
“on-demand” means the nvidia gpu needs to be explicitly invoked to render an application, see:
https://download.nvidia.com/XFree86/Linux-x86_64/495.44/README/primerenderoffload.html

I created a service and a egpu script.
Both works, but not during boot.

Is it possible, that “Wanted-By=display-manager.service” instead .target?

Ok, for start during boot I have to play. Meanwhile I started the service manually and then I start the sddm.service.

The /etc/X11/xorg.conf.d/10-nvidia-egpu.conf also is created and the /etc/X11/xorg.conf was deleted. But my monitor keeps black.

nvidia-smi always write off, like this in the other post.

|   0  GeForce RTX 3070    Off  |

How can I enable the GPU manually? prim-switch is set to “nvidia” aka “performance mode”.

I would mainly use the nvidia card instead the AMD devices.

“Off” means persistence mode is Off, not the gpu. This is fine.
You might also want to try

Before=display-manager.service

Hi,

after many tries with “display-manager.service”, “graphical-target” Before, WantedBy, etc… I created a cronjob which start the script at boot. That seems to work. I don’t know why, but the systemd service file didn’t worked.

Now my last Problem. Monitor is connected to egpu, but no display.
/etc/X11/xorg.conf.d/10-nvidia-egpu.conf is created, but that has no effect.

How I get the Screen over the 1050? How I can switch to the 1050?

Please create a new nvidia-bug-report.log

nvidia-bug-report.log (2.0 MB)
My actual bug report

This looks like the driver is already loaded when the script removes/readds the bus, so it gets removed. Furthermore, it’s doing it too late, the Xserver has already started when the nvidia gpu comes alive.
X start after 10.5s
nvidia gpu ready after 13.2s

My latest bug report. The .service file now works, and I removed it from crontab at boot
nvidia-bug-report.log (2.0 MB)

Where you see the time it was loaded? The .service file should load it before display-manager.service.

But the Monitor still is black.

I also tried the command
$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep vendor
This worked.

server glx vendor string: SGI
client glx vendor string: NVIDIA Corporation
OpenGL vendor string: NVIDIA Corporation

But with more compley graphic demos it fails.

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia ./valley

Loading "/home/michael/.Valley/valley_1.0.cfg"...
Loading "libGPUMonitor_x64.so"...
Loading "libGL.so.1"...
Loading "libopenal.so.1"...
Set 2560x1440 fullscreen video mode
X Error of failed request:  BadAlloc (insufficient resources for operation)
  Major opcode of failed request:  152 (GLX)
  Minor opcode of failed request:  5 (X_GLXMakeCurrent)
  Serial number of failed request:  0
  Current serial number in output stream:  59
AL lib: (EE) alc_cleanup: 1 device not closed

It nearly works, a little is missing.

Just take a look at the timestamps in dmesg and xorg logs, then you see when things happen.
The timing seems to be correct now but the driver isn’t loaded after re-adding the gpu. Try adding a
modprobe nvidia
at the end of your script, maybe with a sleep 1 (or 2) before and after it.

Ok, my simple script.

#!/bin/bash
echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/remove
echo 1 > /sys/bus/pci/rescan
echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/remove
echo 1 > /sys/bus/pci/rescan
sleep 1
modprobe nvidia
sleep 1

It seems to work.
nvidia-bug-report.log (2.0 MB)

xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x5b cap: 0x9, Source Output, Sink Offload crtcs: 6 outputs: 6 associated providers: 1 name:AMD Radeon HD 7800 Series @ pci:0000:06:00.0
Provider 1: id: 0xab cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 6 associated providers: 1 name:AMD Radeon HD 7800 Series @ pci:0000:02:00.0

A question. The Mac Pro has two AMD and one nvidia. Should this be noted with the xorg.conf file?
Is it possible, that I need a third “Provider”?
“Provider 3: id: … nvidia…”

Another question. Is it easier to use the nvidia GPU for rendering, with prime-run or better? This already runned with manjaro, but it was very slow.

The driver doesn’t load, so the gpu is not used by Xorg.
Try adding
modprobe -r nvidia
at the beginning of the script.

Also check journalctl -e for why it doesn’t load.

Can you show me how this would look like when the kernel was loaded?
Sry, I don’t know what entry I exactly should looking for.

Here the logs
journal.txt (106.8 KB)
nvidia-bug-report.log (2.0 MB)

I also added to the service the entry
After=bolt.service

I thought the bolt.service should be completed, before the script start.

It seems that vulkan API works with the nvidia GPU

__NV_PRIME_RENDER_OFFLOAD=1 vkcube    
WARNING: radv is not a conformant Vulkan implementation, testing use only.
WARNING: radv is not a conformant Vulkan implementation, testing use only.
nvidia-smi 
Sun Dec 26 21:11:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:19:00.0 Off |                  N/A |
|  0%   36C    P0    N/A /  90W |     10MiB /  4040MiB |     81%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5139    C+G   vkcube                              7MiB |
+-----------------------------------------------------------------------------+

It’s running on the plain drm device but as you can see, there’s no Xorg process on the gpu since it doesn’t have a dri dev node.
from dmesg:
initial boot, nvidia gpu doesn’t work:

[    0.549169] pci 0000:19:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[    0.549171] pci 0000:19:00.0: BAR 1: trying firmware assignment [mem 0xc0000000-0xcfffffff 64bit pref]
[    0.549172] pci 0000:19:00.0: BAR 1: [mem 0xc0000000-0xcfffffff 64bit pref] conflicts with PCI Bus 0000:00 [mem 0x80000000-0xdfffffff window]
[    0.549174] pci 0000:19:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[    0.549176] pci 0000:19:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.549178] pci 0000:19:00.0: BAR 3: trying firmware assignment [mem 0xd0000000-0xd1ffffff 64bit pref]
[    0.549179] pci 0000:19:00.0: BAR 3: [mem 0xd0000000-0xd1ffffff 64bit pref] conflicts with PCI Bus 0000:00 [mem 0x80000000-0xdfffffff window]
[    0.549181] pci 0000:19:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    0.549183] pci 0000:19:00.0: BAR 0: assigned [mem 0xa1000000-0xa1ffffff]
[    0.549190] pci 0000:19:00.1: BAR 0: assigned [mem 0xa2000000-0xa2003fff]
[    0.549197] pci 0000:19:00.0: BAR 5: no space for [io  size 0x0080]
[    0.549198] pci 0000:19:00.0: BAR 5: failed to assign [io  size 0x0080]

then the nvidia driver loads on the defunct device and fails to create /dev/dri/card0

[    2.357771] nvidia: loading out-of-tree module taints kernel.
[    2.357783] nvidia: module license 'NVIDIA' taints kernel.
[    2.567434] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.86  Tue Oct 26 21:46:51 UTC 2021
[    2.571815] [drm] [nvidia-drm] [GPU ID 0x00001900] Loading driver
[    2.576558] NVRM: GPU 0000:19:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[    2.576627] NVRM: GPU 0000:19:00.0: rm_init_adapter failed, device minor number 0
[    2.576766] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00001900] Failed to allocate NvKmsKapiDevice
[    2.576984] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00001900] Failed to register device

Then the device is removed, readded and the gpu is working

[   10.890204] pci 0000:19:00.0: BAR 1: assigned [mem 0xb0000000-0xbfffffff 64bit pref]
[   10.890225] pci 0000:19:00.0: BAR 3: assigned [mem 0xa8000000-0xa9ffffff 64bit pref]
[   10.890245] pci 0000:19:00.0: BAR 0: assigned [mem 0xa1000000-0xa1ffffff]
[   10.890252] pci 0000:19:00.0: BAR 6: assigned [mem 0xa2000000-0xa207ffff pref]
[   10.890254] pci 0000:19:00.1: BAR 0: assigned [mem 0xa2080000-0xa2083fff]
[   10.890261] pci 0000:19:00.0: BAR 5: assigned [io  0x5000-0x507f]

but all of this while the nvidia driver is still loaded so the missig dri dev node is not recreated.
You’ll have to make sure the driver is unloaded and reloaded after the pci device is working so the dri node is correctly created for Xorg. Since this is happening after amdgpu loading, should be /dev/dri/card2 then.