System:
LIQID stack with 24x A100 GPUs
8x Hosts are connected to LIQID via PCIE connectors
Running singularity containers to access cards connected to the 8 hosts
Running static configuration for each host (GPUs do not change on the fly)
Problem:
Host-side command of “nvidia-smi” reports the GPU as expected:
n85 ~]# nvidia-smi
Thu Mar 16 13:34:35 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off| 00000000:64:00.0 Off | 0 |
| N/A 31C P0 35W / 250W| 0MiB / 40960MiB | 4% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
But in containers… I get the dreaded “No devices were found” error:
n85 ~]# singularity run --nv tensorflow.sif nvidia-smi
No devices were found
When this error occurs, a corresponding error is printed in dmesg:
[ 235.480540] NVRM: GPU 0000:64:00.0: RmInitAdapter failed! (0x61:0x0:1542)
[ 235.480593] NVRM: GPU 0000:64:00.0: rm_init_adapter failed, device minor number 0
Oddly- I have also seen this work at times- so it is not consistent (one run may work, while a second run right after will fail)
I have deployed all of these hosts with automation- so I am fairly confident their configurations have been kept similar.
Things I have attempted so far:
reconfigured with a new A100 GPU that worked reliably in another host: same result
Updated graphics to latest drivers: same result
Updated Singularity (now apptainer) : same result
Tested newer containers with GPU support: same result
installed strace in the tensorflow container:
stat(“/dev/nvidia0”, {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0), …}) = 0
openat(AT_FDCWD, “/dev/nvidia0”, O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
openat(AT_FDCWD, “/dev/nvidia0”, O_RDWR) = -1 EIO (Input/output error)
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd1, 0xc), 0x7ffcbf567bc4) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffcbf565870) = 0
getpid() = 9616
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), …}) = 0
write(1, “No devices were found\n”, 22No devices were found
) = 22
My toolkit for testing GPUs is something that I am looking at adding to… anyone able to point me in a direction that can help debug this further?