Nvidia-smi "No device where found"

Just take a look at the timestamps in dmesg and xorg logs, then you see when things happen.
The timing seems to be correct now but the driver isn’t loaded after re-adding the gpu. Try adding a
modprobe nvidia
at the end of your script, maybe with a sleep 1 (or 2) before and after it.

Ok, my simple script.

#!/bin/bash
echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/remove
echo 1 > /sys/bus/pci/rescan
echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/remove
echo 1 > /sys/bus/pci/rescan
sleep 1
modprobe nvidia
sleep 1

It seems to work.
nvidia-bug-report.log (2.0 MB)

xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x5b cap: 0x9, Source Output, Sink Offload crtcs: 6 outputs: 6 associated providers: 1 name:AMD Radeon HD 7800 Series @ pci:0000:06:00.0
Provider 1: id: 0xab cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 6 associated providers: 1 name:AMD Radeon HD 7800 Series @ pci:0000:02:00.0

A question. The Mac Pro has two AMD and one nvidia. Should this be noted with the xorg.conf file?
Is it possible, that I need a third “Provider”?
“Provider 3: id: … nvidia…”

Another question. Is it easier to use the nvidia GPU for rendering, with prime-run or better? This already runned with manjaro, but it was very slow.

The driver doesn’t load, so the gpu is not used by Xorg.
Try adding
modprobe -r nvidia
at the beginning of the script.

Also check journalctl -e for why it doesn’t load.

Can you show me how this would look like when the kernel was loaded?
Sry, I don’t know what entry I exactly should looking for.

Here the logs
journal.txt (106.8 KB)
nvidia-bug-report.log (2.0 MB)

I also added to the service the entry
After=bolt.service

I thought the bolt.service should be completed, before the script start.

It seems that vulkan API works with the nvidia GPU

__NV_PRIME_RENDER_OFFLOAD=1 vkcube    
WARNING: radv is not a conformant Vulkan implementation, testing use only.
WARNING: radv is not a conformant Vulkan implementation, testing use only.
nvidia-smi 
Sun Dec 26 21:11:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:19:00.0 Off |                  N/A |
|  0%   36C    P0    N/A /  90W |     10MiB /  4040MiB |     81%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5139    C+G   vkcube                              7MiB |
+-----------------------------------------------------------------------------+

It’s running on the plain drm device but as you can see, there’s no Xorg process on the gpu since it doesn’t have a dri dev node.
from dmesg:
initial boot, nvidia gpu doesn’t work:

[    0.549169] pci 0000:19:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[    0.549171] pci 0000:19:00.0: BAR 1: trying firmware assignment [mem 0xc0000000-0xcfffffff 64bit pref]
[    0.549172] pci 0000:19:00.0: BAR 1: [mem 0xc0000000-0xcfffffff 64bit pref] conflicts with PCI Bus 0000:00 [mem 0x80000000-0xdfffffff window]
[    0.549174] pci 0000:19:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[    0.549176] pci 0000:19:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.549178] pci 0000:19:00.0: BAR 3: trying firmware assignment [mem 0xd0000000-0xd1ffffff 64bit pref]
[    0.549179] pci 0000:19:00.0: BAR 3: [mem 0xd0000000-0xd1ffffff 64bit pref] conflicts with PCI Bus 0000:00 [mem 0x80000000-0xdfffffff window]
[    0.549181] pci 0000:19:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    0.549183] pci 0000:19:00.0: BAR 0: assigned [mem 0xa1000000-0xa1ffffff]
[    0.549190] pci 0000:19:00.1: BAR 0: assigned [mem 0xa2000000-0xa2003fff]
[    0.549197] pci 0000:19:00.0: BAR 5: no space for [io  size 0x0080]
[    0.549198] pci 0000:19:00.0: BAR 5: failed to assign [io  size 0x0080]

then the nvidia driver loads on the defunct device and fails to create /dev/dri/card0

[    2.357771] nvidia: loading out-of-tree module taints kernel.
[    2.357783] nvidia: module license 'NVIDIA' taints kernel.
[    2.567434] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.86  Tue Oct 26 21:46:51 UTC 2021
[    2.571815] [drm] [nvidia-drm] [GPU ID 0x00001900] Loading driver
[    2.576558] NVRM: GPU 0000:19:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[    2.576627] NVRM: GPU 0000:19:00.0: rm_init_adapter failed, device minor number 0
[    2.576766] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00001900] Failed to allocate NvKmsKapiDevice
[    2.576984] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00001900] Failed to register device

Then the device is removed, readded and the gpu is working

[   10.890204] pci 0000:19:00.0: BAR 1: assigned [mem 0xb0000000-0xbfffffff 64bit pref]
[   10.890225] pci 0000:19:00.0: BAR 3: assigned [mem 0xa8000000-0xa9ffffff 64bit pref]
[   10.890245] pci 0000:19:00.0: BAR 0: assigned [mem 0xa1000000-0xa1ffffff]
[   10.890252] pci 0000:19:00.0: BAR 6: assigned [mem 0xa2000000-0xa207ffff pref]
[   10.890254] pci 0000:19:00.1: BAR 0: assigned [mem 0xa2080000-0xa2083fff]
[   10.890261] pci 0000:19:00.0: BAR 5: assigned [io  0x5000-0x507f]

but all of this while the nvidia driver is still loaded so the missig dri dev node is not recreated.
You’ll have to make sure the driver is unloaded and reloaded after the pci device is working so the dri node is correctly created for Xorg. Since this is happening after amdgpu loading, should be /dev/dri/card2 then.

Ok, I understand, thx.
It is a timing problem?

The entry /dev/dri/card2 not exist.

modprobe-r nvidia says
modprobe: FATAL: Module nvidia is in use.

I think the same problem during my script is called.

First I remove it with
modprobe -r nvidia
sleep 1
remove pci and scan…

Please check if nvidia-persistenced is enabled with systemd and disable it.

Yes it is enabled
But disable or mask won’t work.

After boot it is loaded. What is this?

systemctl status nvidia-persistenced.service 
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
     Active: active (running) since Sun 2021-12-26 22:25:22 CET; 30s ago
    Process: 935 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 938 (nvidia-persiste)
      Tasks: 1 (limit: 38356)
     Memory: 724.0K
        CPU: 3ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─938 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

Dez 26 22:25:22 michael-MacPro systemd[1]: Starting NVIDIA Persistence Daemon...
Dez 26 22:25:22 michael-MacPro nvidia-persistenced[938]: Verbose syslog connection opened
Dez 26 22:25:22 michael-MacPro nvidia-persistenced[938]: Now running with user ID 124 and group ID 134
Dez 26 22:25:22 michael-MacPro nvidia-persistenced[938]: Started (938)
Dez 26 22:25:22 michael-MacPro nvidia-persistenced[938]: device 0000:19:00.0 - registered
Dez 26 22:25:22 michael-MacPro nvidia-persistenced[938]: Local RPC services initialized
Dez 26 22:25:22 michael-MacPro systemd[1]: Started NVIDIA Persistence Daemon.

What can I do with the command
$ nvidia-persistenced

It’s needed for headless, compute-only servers to keep the driver loaded and initialized.
Should be no problem to disable it, iirc ubuntu uses a udev rule in /lib/udev/rules.d to start it. Try removing that and run sudo update-initramfs -u to also remove it from initrd.

Or just add
systemctl stop nvidia-persistenced
at the start of your script.

Ok, the service is deactivated.

I removed it from rules.d.

systemctl status nvidia-persistenced.service                
○ nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
Active: inactive (dead)

How does this helps?

Like said, it keeps the driver loaded, thus locked. You should now be able to unload it.

No, the message is still the same.

sudo modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia
modprobe: FATAL: Module nvidia is in use.

Since you disabled nvidia-persistenced and rebooted, maybe now your script is working and xorg is blocking module unloading which is fine?
pls check nvidia-smi for the xorg process.

No, I checked smi, but no process.

Now I switched back to manjaro linux.
I also need my script to remove and scan the pci bus. I don’t know why, the service won’t start at boot.
So I need to start it manually or with crontab.

But… I can use the gpu with prime-run. It is not as perfect, as I use the GPU directly, but it is better then nothing, and it’s much more faster than the amd gpu. For the moment I will use this.

Perhaps I got the service running at boot, and under manjaro the timing is better for reloading the nvidia driver?

In manjaro I have the third card.
Is it possible to switch manually to the nvidia video output?

   /  ls -la /dev/dri                                                                                                                                                              ✔ 
insgesamt 0
drwxr-xr-x   3 root root        180 27. Dez 11:22 .
drwxr-xr-x  22 root root       4280 27. Dez 12:16 ..
drwxr-xr-x   2 root root        160 27. Dez 11:22 by-path
crw-rw----+  1 root video  226,   0 27. Dez 11:22 card0
crw-rw----+  1 root video  226,   1 27. Dez 11:22 card1
crw-rw----+  1 root video  226,   2 27. Dez 12:59 card2
crw-rw-rw-   1 root render 226, 128 27. Dez 11:22 renderD128
crw-rw-rw-   1 root render 226, 129 27. Dez 11:22 renderD129
crw-rw-rw-   1 root render 226, 130 27. Dez 11:22 renderD130
    /  ls -la /dev/dri/by-path                                                                                                                                                      ✔ 
insgesamt 0
drwxr-xr-x 2 root root 160 27. Dez 11:22 .
drwxr-xr-x 3 root root 180 27. Dez 11:22 ..
lrwxrwxrwx 1 root root   8 27. Dez 11:22 pci-0000:02:00.0-card -> ../card1
lrwxrwxrwx 1 root root  13 27. Dez 11:22 pci-0000:02:00.0-render -> ../renderD129
lrwxrwxrwx 1 root root   8 27. Dez 12:59 pci-0000:06:00.0-card -> ../card2
lrwxrwxrwx 1 root root  13 27. Dez 11:22 pci-0000:06:00.0-render -> ../renderD130
lrwxrwxrwx 1 root root   8 27. Dez 11:22 pci-0000:19:00.0-card -> ../card0
lrwxrwxrwx 1 root root  13 27. Dez 11:22 pci-0000:19:00.0-render -> ../renderD128

How does this loooks?

The script works now at boot, and the /dev/dri/card0 - 2 are available.

journal.txt (129.1 KB)

nvidia-bug-report.log (1.0 MB)

I have a screen from the eGPU now.

Providers: number : 3
Provider 0: id: 0x1b7 cap: 0x1, Source Output crtcs: 4 outputs: 4 associated providers: 2 name:NVIDIA-0
Provider 1: id: 0x243 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 6 outputs: 6 associated providers: 1 name:modesetting
Provider 2: id: 0x208 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 6 outputs: 6 associated providers: 1 name:modesetting
    ~  nvidia-smi                                                                                                                                                        ✔ 
Mon Dec 27 18:46:30 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:19:00.0  On |                  N/A |
|  0%   55C    P0    N/A /  90W |    102MiB /  4040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1505      G   /usr/lib/Xorg                      99MiB |
|    0   N/A  N/A      2139      G   /usr/bin/nvidia-settings            0MiB |
+-----------------------------------------------------------------------------+

I used this xorg.conf

Section "Module"
    Load "modesetting"
EndSection

Section "Device"
    Identifier "Device0"
    Driver     "nvidia"
    BusID      "PCI:25:0:0"
    Option     "AllowEmptyInitialConfiguration"
    Option     "AllowExternalGpus" "True"
EndSection

nvidia-settings also shows me the card infos and Displays.
But it is very slow. I think it doesn’t use the GPU for 3D, only for display. What can I do?
nvidia-bug-report.log (1.0 MB)