Hi! Sometimes nvidia-smi shows “ERR!” in “Fan” column like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 P106-100 On | 00000000:02:00.0 Off | N/A |
|ERR! 40C P0 59W / 60W | 3023MiB / 6080MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
I assume that this can be related to Fan power setting to 100%. Before I used 396.45 driver version (with this version Fan percentage displayed the same way that I’ve set it (“100%”)). In new version if I set Fan speed (with nvidia-settings) to “99%”, then in nvidia-smi I see sometimes “98%”, “99%”, “100%” (somehow correlates with fan tachometer) (also if I manually block fan, then it shows “0%” and also sometimes I see “ERR!” (but rare)). There is no “ERR!” messages if I set it to, for example, 60%. My assuming is that “ERR!” happens when fan speed is something like “101%”.
NVML function nvmlDeviceGetFanSpeed in “ERR!” moments returns NVML_ERROR_UNKNOWN (999).
My OS is CoreOS v1506.0.0(alpha) (with Linux kernel v4.12.7).
Maybe someone knows how to avoid “ERR!” with “100%” Fan setting? Or maybe knows nearest driver version for CUDA 10.0 (>=410.48) that has no such problem?