Intermittent "No devices were found" on CentOS 7

mnsmar · December 7, 2021, 12:18am

I have 3 V100s on the system. I run nvidia-smi and sometimes it reports “No devices were found”.

Other times it shows all GPUs fine.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   47C    P0    38W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   47C    P0    38W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100S-PCI...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   49C    P0    40W / 250W |      0MiB / 32510MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is creating issues because scripts that depend on the GPUs crash intermittently. Any suggestions or ideas how to fix?

Thanks

rs277 · December 7, 2021, 2:17am

Is this something that has only just started occuring?

If not, is the system running an X server and/or nvidia-persistenced?

mnsmar · December 7, 2021, 2:34am

Yes this started happening recently.

The system is running an X server. I don’t see nvidia-persistenced.

mnsmar · December 7, 2021, 2:39am

I also see this in dmesg | grep -i NVRM

[1143867.699655] NVRM: GPU 0000:1b:00.0: Failed to copy vbios to system memory.
[1143867.699998] NVRM: GPU 0000:1b:00.0: RmInitAdapter failed! (0x30:0xffff:794)
[1143867.700028] NVRM: GPU 0000:1b:00.0: rm_init_adapter failed, device minor number 0
[1143869.468771] NVRM: GPU 0000:1e:00.0: Failed to copy vbios to system memory.
[1143869.469158] NVRM: GPU 0000:1e:00.0: RmInitAdapter failed! (0x30:0xffff:794)
[1143869.469190] NVRM: GPU 0000:1e:00.0: rm_init_adapter failed, device minor number 1
[1143872.322644] NVRM: GPU 0000:b5:00.0: Failed to copy vbios to system memory.
[1143872.323024] NVRM: GPU 0000:b5:00.0: RmInitAdapter failed! (0x30:0xffff:794)
[1143872.323101] NVRM: GPU 0000:b5:00.0: rm_init_adapter failed, device minor number 2

rs277 · December 7, 2021, 2:53am

Sorry, I’m out of ideas then. That dmesg I’d have thought indicates that the cards are gone until at least a reboot, not nvidia-smi working intermittently. Or do you have to reboot to get things back?

mnsmar · December 7, 2021, 3:19am

Yes seems like the cards are completely gone after this message. It’s been some time and nvidia-smi still reports “no cards” consistently.

It’s weird because nvidia-smi indeed had intermittent behavior before that dmesg. I wonder if it’s software or hardware.

rs277 · December 7, 2021, 3:23am

I’d suspect the power supply. Do the logs show any entries, “GPU has fallen off the bus”. ?

mnsmar · December 7, 2021, 3:28am

Unbelievable. No reboot, nothing. nvidia-smi reports them again. I don’t see any “GPU has fallen off the bus” in dmesg.

rs277 · December 7, 2021, 3:39am

That is odd. The “GPU has fallen off the bus” is a good indicator of a sagging PSU.

Apart from reinstalling the driver, I’m out of ideas. Having said that, have a read of the persistenced page and maybe get that running, although it’s unlikely given you’ve not had any trouble until now.

https://docs.nvidia.com/deploy/driver-persistence/index.html

mnsmar · December 7, 2021, 3:42am

Thanks I’ll give it a try.