Intermittent "No devices were found" on CentOS 7

I have 3 V100s on the system. I run nvidia-smi and sometimes it reports “No devices were found”.

Other times it shows all GPUs fine.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   47C    P0    38W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   47C    P0    38W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100S-PCI...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   49C    P0    40W / 250W |      0MiB / 32510MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is creating issues because scripts that depend on the GPUs crash intermittently. Any suggestions or ideas how to fix?

Thanks

Is this something that has only just started occuring?

If not, is the system running an X server and/or nvidia-persistenced?

Yes this started happening recently.

The system is running an X server. I don’t see nvidia-persistenced.

I also see this in dmesg | grep -i NVRM

[1143867.699655] NVRM: GPU 0000:1b:00.0: Failed to copy vbios to system memory.
[1143867.699998] NVRM: GPU 0000:1b:00.0: RmInitAdapter failed! (0x30:0xffff:794)
[1143867.700028] NVRM: GPU 0000:1b:00.0: rm_init_adapter failed, device minor number 0
[1143869.468771] NVRM: GPU 0000:1e:00.0: Failed to copy vbios to system memory.
[1143869.469158] NVRM: GPU 0000:1e:00.0: RmInitAdapter failed! (0x30:0xffff:794)
[1143869.469190] NVRM: GPU 0000:1e:00.0: rm_init_adapter failed, device minor number 1
[1143872.322644] NVRM: GPU 0000:b5:00.0: Failed to copy vbios to system memory.
[1143872.323024] NVRM: GPU 0000:b5:00.0: RmInitAdapter failed! (0x30:0xffff:794)
[1143872.323101] NVRM: GPU 0000:b5:00.0: rm_init_adapter failed, device minor number 2

Sorry, I’m out of ideas then. That dmesg I’d have thought indicates that the cards are gone until at least a reboot, not nvidia-smi working intermittently. Or do you have to reboot to get things back?

Yes seems like the cards are completely gone after this message. It’s been some time and nvidia-smi still reports “no cards” consistently.

It’s weird because nvidia-smi indeed had intermittent behavior before that dmesg. I wonder if it’s software or hardware.

I’d suspect the power supply. Do the logs show any entries, “GPU has fallen off the bus”. ?

Unbelievable. No reboot, nothing. nvidia-smi reports them again. I don’t see any “GPU has fallen off the bus” in dmesg.

That is odd. The “GPU has fallen off the bus” is a good indicator of a sagging PSU.

Apart from reinstalling the driver, I’m out of ideas. Having said that, have a read of the persistenced page and maybe get that running, although it’s unlikely given you’ve not had any trouble until now.

https://docs.nvidia.com/deploy/driver-persistence/index.html

Thanks I’ll give it a try.