NVRM: GPU 0000:64:00.0: RmInitAdapter failed!

LIQID stack with 24x A100 GPUs
8x Hosts are connected to LIQID via PCIE connectors
Running singularity containers to access cards connected to the 8 hosts
Running static configuration for each host (GPUs do not change on the fly)

Host-side command of “nvidia-smi” reports the GPU as expected:
n85 ~]# nvidia-smi
Thu Mar 16 13:34:35 2023
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA A100-PCIE-40GB Off| 00000000:64:00.0 Off | 0 |
| N/A 31C P0 35W / 250W| 0MiB / 40960MiB | 4% Default |
| | | Disabled |
But in containers… I get the dreaded “No devices were found” error:
n85 ~]# singularity run --nv tensorflow.sif nvidia-smi
No devices were found

When this error occurs, a corresponding error is printed in dmesg:
[ 235.480540] NVRM: GPU 0000:64:00.0: RmInitAdapter failed! (0x61:0x0:1542)
[ 235.480593] NVRM: GPU 0000:64:00.0: rm_init_adapter failed, device minor number 0

Oddly- I have also seen this work at times- so it is not consistent (one run may work, while a second run right after will fail)
I have deployed all of these hosts with automation- so I am fairly confident their configurations have been kept similar.
Things I have attempted so far:
reconfigured with a new A100 GPU that worked reliably in another host: same result
Updated graphics to latest drivers: same result
Updated Singularity (now apptainer) : same result
Tested newer containers with GPU support: same result
installed strace in the tensorflow container:
stat(“/dev/nvidia0”, {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0), …}) = 0
openat(AT_FDCWD, “/dev/nvidia0”, O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
openat(AT_FDCWD, “/dev/nvidia0”, O_RDWR) = -1 EIO (Input/output error)
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd1, 0xc), 0x7ffcbf567bc4) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffcbf565870) = 0
getpid() = 9616
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), …}) = 0
write(1, “No devices were found\n”, 22No devices were found
) = 22

My toolkit for testing GPUs is something that I am looking at adding to… anyone able to point me in a direction that can help debug this further?

Does disabling the gsp firmware work around the issue?

Looking at the output from /proc/driver/nvidia/gpus/*/information, it looks like it is disabled:
cat /proc/driver/nvidia/gpus/0000:64:00.0/information
Model: NVIDIA A100-PCIE-40GB
IRQ: 29
GPU UUID: GPU-3ebdfb5d-d358-0c96-1693-b0cd44892912
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:64:00.0
Device Minor: 0
GPU Firmware: N/A
GPU Excluded: No

But it looks like it is enabled from the output of nvidia-smi:
nvidia-smi -q | grep -i gsp
GSP Firmware Version : 530.30.02

I ran modprobe -r nvidia ; modprobe nvidia NVreg_EnableGpuFirmware=0 , but the results from nvidia-smi and /proc/driver/nvidia/gpus/0000:64:00.0/information have me a bit confused if it is actually disabled… thoughts?

Also- if it is disabled… the work-around doesn’t appear to have worked… unfortunately

Video BIOS: ??.??.??.??.??
is shown, the gpu is not initialized and all info is bogus. nvidia-smi is the reliable way since it initializes the gpu to read any info. So if nvidia-smi showed a gsp version, the module unload/reload didn’t have the disired effect.

Ok- that makes sense, looking at modprobe -v… it appears that it is being set(I tossed this config into /etc/modprobe.d/nvidia.conf):
modprobe -v nvidia
insmod /lib/modules/3.10.0-1160.25.1.el7.x86_64/extra/nvidia.ko.xz NVreg_EnableGpuFirmware=0
insmod /lib/modules/3.10.0-1160.25.1.el7.x86_64/extra/nvidia-uvm.ko.xz

But I am still not seeing N/A in the output of nvidia-smi -q:
nvidia-smi -q | grep -i gsp
GSP Firmware Version : 530.30.02

Any other way of attempting to disable GSP?

Try setting it as kernel parameter and reboot. I suspect firmware unloading is not supported.

same result with the kernel parameter I am afraid…