I have 3 V100s on the system. I run nvidia-smi and sometimes it reports “No devices were found”.
Other times it shows all GPUs fine.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:1B:00.0 Off | 0 |
| N/A 47C P0 38W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:1E:00.0 Off | 0 |
| N/A 47C P0 38W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100S-PCI... Off | 00000000:B5:00.0 Off | 0 |
| N/A 49C P0 40W / 250W | 0MiB / 32510MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This is creating issues because scripts that depend on the GPUs crash intermittently. Any suggestions or ideas how to fix?
Thanks