I have two boxes: one (“infra-eno”) is hosted in GCP with a single A100 attached, and one is a physical server in my own data center (“jaguar”) with two A100s attached. Both could talk to their respective GPUs at one point using nvidia-smi
. I upgraded to CUDA 12.5, and the cloud box (“infra-eno”) worked just fine but the physical server (“jaguar”) lost the ability to see the GPUs via nvidia-smi
.
robert@jaguar:~$ nvidia-smi -L
No devices found.
robert@jaguar:~$ lspci | grep -i nvidia
33:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
34:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
robert@jaguar:~$ nvidia-smi --version
NVIDIA-SMI version : 555.42.02
NVML version : 555.42
DRIVER version : 555.42.02
CUDA Version : 12.5
I tried downgrading to Nvidia 545, but that has some GPL issue and so it can’t compile the boot disk. So I tried downgrading to Nvidia 535, which got me a new init disk but still no device connectivity. I tried Nvidia 550 but nvidia-smi still doesn’t find any devices.
Note that the CUDA 12.5/Nvidia 555 configuration works just fine for infra-eno:
robert@infra-eno:~$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-0c5419ce-1616-144d-a4f8-71333d2294ad)
robert@infra-eno:~$ lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
robert@infra-eno:~$ nvidia-smi --version
NVIDIA-SMI version : 555.42.02
NVML version : 555.42
DRIVER version : 555.42.02
CUDA Version : 12.5
Here’s the bug report from Jaguar:
nvidia-bug-report.log.gz (1.4 MB)
Any idea how to get Jaguar to see the A100s again?