Every time I run nvidia-smi to check on system status, I get a different output:
-sh-4.1$ nvidia-smi
Failed to initialize NVML: Unknown Error
-sh-4.1$ nvidia-smi
Unable to determine the device handle for GPU 0000:04:00.0: The NVIDIA kernel
module detected an issue with GPU interrupts.Consult the "Common Problems"
Chapter of the NVIDIA Driver README for
details and steps that can be taken to resolve this issue.
-sh-4.1$ nvidia-smi
Tue Jan 12 16:03:49 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:04:00.0 Off | 0 |
| 30% 32C P0 49W / 225W | 12MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c Off | 0000:84:00.0 Off | 0 |
| 30% 35C P0 53W / 225W | 12MiB / 4799MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Couple of questions;
- How do I diagnose this inconsistent output?
- When nvidia-smi does detect my GPUs, why is the volatile GPU-util 95% in the second GPU although there are not running processes? This always happens to the 2nd GPU.