I am running Ubuntu 22.04.3 LTS there are TWO 4090 GPUs;
# lspci -k | grep -EA2 'VGA|3D'
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device 3675
Kernel driver in use: nvidia
--
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev ff)
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation Device 22ba (rev ff)
However only one appears:
# nvidia-smi
Wed Dec 13 16:50:12 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | 0 |
| 0% 29C P8 11W / 450W | 3MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
dmesg has this error after nvidia-smi
[282088.884537] nvidia-nvlink: Nvlink Core is being initialized, major device number 504
[282088.885660] nvidia 0000:21:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[282088.885760] nvidia 0000:21:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[282088.885793] NVRM: The NVIDIA GPU 0000:21:00.0
NVRM: (PCI ID: 10de:2684) installed in this system has
NVRM: fallen off the bus and is not responding to commands.
[282088.885827] nvidia: probe of 0000:21:00.0 failed with error -1
[282088.885882] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[282088.931434] NVRM: The NVIDIA probe routine failed for 1 device(s).
[282088.931438] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 545.23.08 Mon Nov 6 23:49:37 UTC 2023
How to fix?
How to get back on the bus?