uncorrected ECC - how to get back on track

I have an ‘uncorr. ecc’ problem on my K80 that is apparently preventing its use:

root@x:~# nvidia-smi
Sun Feb 12 11:00:53 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   66C    P0   105W / 149W |  10819MiB / 11439MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   53C    P0   145W / 149W |  10819MiB / 11439MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    2 |
| N/A   43C    P8    29W / 149W |      2MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   54C    P0   151W / 149W |  10819MiB / 11439MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14252    C   python solve_jr.py 0                         10815MiB |
|    1      9625    C   python solve_jr.py 1                         10815MiB |
|    3     15555    C   python solve_jr.py 3                         10815MiB |
+-----------------------------------------------------------------------------+

Apparently some sort of error has occurred on gpu #2, should i turn off ecc , reboot the machine, or what?

Trying a gpu reset

nvidia-smi -r -i 2

asked for all gpu processes (even when they are running on other gpus) to be killed (which was what i was trying to avoid…) but even killing all processes doesn’t allow reset:

root@x:~# nvidia-smi
Sun Feb 12 11:37:31 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   43C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   46C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    2 |
| N/A   39C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@x:~# nvidia-smi -r -i 2
Unable to reset this GPU because it's being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.
Terminating early due to previous errors.

machine reboot got the gpu back at the cost of a day’s computation

Has this ever happened to your rig before? If so, what were the circumstances?

What do you suspect was the cause this time?

Are you employing a power conditioner?

Does the motherboard you are using offer the option of scrubbing all of its ECC RAM every 8 hours (my old Asus M5A88-V EVO does via its ‘Super’ setting)? If so, are you taking advantage of that feature or does it impair the performance of running computational tasks?

Just so I can learn more on my own time, what are the makes & model #s of the CPU, RAM and motherboard you are using? I’d like to know more about how pro-grade gear interacts with ECC RAM.

BTW. This is what woke me up to the value of using such RAM:

The following lecture is by Artem Dinaburg, a guy who works for Raytheon Company, a major U.S. ‘defense’ contractor:

"It turns out that non-ECC RAM is actually a security risk, as bit flips can be exploited. “Bit-squatting” from Black Hat 2011:

Mar 15, 2013
Blackhat 2011 - Bit-squatting: DNS Hijacking without exploitation - YouTube
http://www.youtube.com/watch?v=_si0FYl_IOA

Bitsquatting: DNS Hijacking without exploitation
http://dinaburg.org/bitsquatting.html

“…As the graph above shows, ECC RAM has a much lower failure rate than non-ECC RAM. The ~1% failure rate of the Kingston non-ECC RAM is still very, very good (which is why we primarily use Kingston), but the ECC RAM is even better at an average .24% failure rate…”

November 5, 2013
Advantages of ECC Memory - Puget Custom Computers
http://www.pugetsystems.com/labs/articles/Advantages-of-ECC-Memory-520/

Related:

May 13, 2014
ECC and REG ECC Memory Performance - Puget Custom Computers
https://www.pugetsystems.com/labs/articles/ECC-and-REG-ECC-Memory-Performance-560/

This was a ‘new’ server (an azure cloud server that I started using Feb. 1).

I ran across this https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf interesting paper while looking for whether i can safely turn off ecc.

As for the other info you were after, this is a GPU I’m talking about not a cpu.

Anyway thanks for the interest

Thank you for the clarification and the DRAM Errors in the Wild: A Large-Scale Field Study.pdf link.