We are using 8*NVIDIA A100-SXM4-40GB gpus on Ubuntu 22.04.
This set-up has been working fine for months now.
All of the sudden, both nvidia-smi
and deviceQuery
are reporting the wrong number of GPUs:
/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery# make run | grep 'CUDA Capable device(s)'
Detected 7 CUDA Capable device(s)
~# nvidia-smi --query-gpu=name,driver_version --format=csv
name, driver_version
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
NVIDIA A100-SXM4-40GB, 515.86.01
Both tools are reporting 7 GPUs. However, 8 are actually installed in the machine and recognized by BIOS.
Furthermore, the OS does detect the 8 GPUs:
~# lshw | grep -i -B 0 -A 5 '3D controller'
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:07:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:0b:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:48:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:4c:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:88:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:8b:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:c8:00.0
version: a1
--
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:cb:00.0
version: a1
Of course, I tried to reboot the machine. It did not help.
Does someone have any other idea?