Nvidia RTX 4090 card missing from nvidia-smi

I am running Ubuntu 22.04.3 LTS there are TWO 4090 GPUs;

# lspci -k | grep -EA2 'VGA|3D'  
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. Device 3675
	Kernel driver in use: nvidia
--
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev ff)
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation Device 22ba (rev ff)

However only one appears:

# nvidia-smi
Wed Dec 13 16:50:12 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                    0 |
|  0%   29C    P8              11W / 450W |      3MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

dmesg has this error after nvidia-smi

[282088.884537] nvidia-nvlink: Nvlink Core is being initialized, major device number 504

[282088.885660] nvidia 0000:21:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[282088.885760] nvidia 0000:21:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[282088.885793] NVRM: The NVIDIA GPU 0000:21:00.0
                NVRM: (PCI ID: 10de:2684) installed in this system has
                NVRM: fallen off the bus and is not responding to commands.
[282088.885827] nvidia: probe of 0000:21:00.0 failed with error -1
[282088.885882] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[282088.931434] NVRM: The NVIDIA probe routine failed for 1 device(s).
[282088.931438] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  545.23.08  Mon Nov  6 23:49:37 UTC 2023

How to fix?
How to get back on the bus?

Hi there @user79475 and welcome back to the forums.

Several things to check first:

  • Power supply
  • Resizable Bar settings: The BIOS of your motherboard must be capable to use “Above 4G decoding” for BAR to be able to address the VRAM on both GPUs.
  • Check what PCIe speeds (2 x 16x, or at least 2 x 8x) your Motheboard supports
  • Check temperatures
  • Check with the second GPU only to test if the second GPU has a Hardware failure.