I have an ‘uncorr. ecc’ problem on my K80 that is apparently preventing its use:
root@x:~# nvidia-smi
Sun Feb 12 11:00:53 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:83:00.0 Off | 0 |
| N/A 66C P0 105W / 149W | 10819MiB / 11439MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 53C P0 145W / 149W | 10819MiB / 11439MiB | 85% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:87:00.0 Off | 2 |
| N/A 43C P8 29W / 149W | 2MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 54C P0 151W / 149W | 10819MiB / 11439MiB | 93% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14252 C python solve_jr.py 0 10815MiB |
| 1 9625 C python solve_jr.py 1 10815MiB |
| 3 15555 C python solve_jr.py 3 10815MiB |
+-----------------------------------------------------------------------------+
Apparently some sort of error has occurred on gpu #2, should i turn off ecc , reboot the machine, or what?