Nvidia-smi cannot recognize A100 but exists in lspci; CUDA 12.5, driver version 555

I have two boxes: one (“infra-eno”) is hosted in GCP with a single A100 attached, and one is a physical server in my own data center (“jaguar”) with two A100s attached. Both could talk to their respective GPUs at one point using nvidia-smi. I upgraded to CUDA 12.5, and the cloud box (“infra-eno”) worked just fine but the physical server (“jaguar”) lost the ability to see the GPUs via nvidia-smi.

robert@jaguar:~$ nvidia-smi -L
No devices found.
robert@jaguar:~$ lspci | grep -i nvidia
33:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
34:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
robert@jaguar:~$ nvidia-smi --version
NVIDIA-SMI version  : 555.42.02
NVML version        : 555.42
DRIVER version      : 555.42.02
CUDA Version        : 12.5

I tried downgrading to Nvidia 545, but that has some GPL issue and so it can’t compile the boot disk. So I tried downgrading to Nvidia 535, which got me a new init disk but still no device connectivity. I tried Nvidia 550 but nvidia-smi still doesn’t find any devices.

Note that the CUDA 12.5/Nvidia 555 configuration works just fine for infra-eno:

robert@infra-eno:~$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-0c5419ce-1616-144d-a4f8-71333d2294ad)
robert@infra-eno:~$ lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
robert@infra-eno:~$ nvidia-smi --version
NVIDIA-SMI version  : 555.42.02
NVML version        : 555.42
DRIVER version      : 555.42.02
CUDA Version        : 12.5

Here’s the bug report from Jaguar:
nvidia-bug-report.log.gz (1.4 MB)

Any idea how to get Jaguar to see the A100s again?

robert@jaguar:~$ cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
blacklist nouveau
options nouveau modeset=0

Just tried adding that in, but an sudo update-initramfs -u and a reboot later and nvidia-smi still isn’t finding the devices.

Here’s the updated bug report, also with --extra-system-data.

nvidia-bug-report.log.gz (976.9 KB)

Both gpus fail with xid 62. I don’t think both are damaged at the same time so I would expect something weird. Does using the nvidia-open driver yield some better logs?

Although shutdown -r now failed to fix the issue, an actual hardware powercycle of the server worked. I’m a bit scared to go in and change to the nvidia-open drivers at this point.