Nvidia-smi is reporting the wrong number of GPUs

We are using 8*NVIDIA A100-SXM4-40GB gpus on Ubuntu 22.04.
This set-up has been working fine for months now.
All of the sudden, both nvidia-smi and deviceQuery are reporting the wrong number of GPUs:

/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery# make run | grep 'CUDA Capable device(s)'
Detected 7 CUDA Capable device(s)

~# nvidia-smi --query-gpu=name,driver_version --format=csv
name, driver_version
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01

Both tools are reporting 7 GPUs. However, 8 are actually installed in the machine and recognized by BIOS.

Furthermore, the OS does detect the 8 GPUs:

~# lshw | grep -i -B 0 -A 5 '3D controller'
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:07:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:0b:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:48:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:4c:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:88:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:8b:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:c8:00.0
                            version: a1
--
                            description: 3D controller
                            product: NVIDIA Corporation
                            vendor: NVIDIA Corporation
                            physical id: 0
                            bus info: pci@0000:cb:00.0
                            version: a1

Of course, I tried to reboot the machine. It did not help.

Does someone have any other idea?

Please run sudo nvidia-bug-report.sh and attach the log file. Typically, problems with these symptoms happen when the system BIOS doesn’t have enough address space to assign the BARs of all eight GPUs. If that’s what’s happening here, then something must have changed from the working configuration. Did any new devices get added to the system?

Please find below the requested logs:
nvidia-bug-report.log.gz (11.5 MB)

The configuration of the machine has not been updated lately. It is part of HPC cluster (~20 identical machines).

Thanks in advance.