uncorrected ECC - how to get back on track

jeremyrutman · February 12, 2017, 5:05pm

I have an ‘uncorr. ecc’ problem on my K80 that is apparently preventing its use:

root@x:~# nvidia-smi
Sun Feb 12 11:00:53 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   66C    P0   105W / 149W |  10819MiB / 11439MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   53C    P0   145W / 149W |  10819MiB / 11439MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    2 |
| N/A   43C    P8    29W / 149W |      2MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   54C    P0   151W / 149W |  10819MiB / 11439MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14252    C   python solve_jr.py 0                         10815MiB |
|    1      9625    C   python solve_jr.py 1                         10815MiB |
|    3     15555    C   python solve_jr.py 3                         10815MiB |
+-----------------------------------------------------------------------------+

Apparently some sort of error has occurred on gpu #2, should i turn off ecc , reboot the machine, or what?

jeremyrutman · February 12, 2017, 5:41pm

Trying a gpu reset

nvidia-smi -r -i 2

asked for all gpu processes (even when they are running on other gpus) to be killed (which was what i was trying to avoid…) but even killing all processes doesn’t allow reset:

root@x:~# nvidia-smi
Sun Feb 12 11:37:31 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   43C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   46C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    2 |
| N/A   39C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@x:~# nvidia-smi -r -i 2
Unable to reset this GPU because it's being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.
Terminating early due to previous errors.

jeremyrutman · February 12, 2017, 5:49pm

machine reboot got the gpu back at the cost of a day’s computation

JGB123321 · February 12, 2017, 9:51pm

Has this ever happened to your rig before? If so, what were the circumstances?

What do you suspect was the cause this time?

Are you employing a power conditioner?

Does the motherboard you are using offer the option of scrubbing all of its ECC RAM every 8 hours (my old Asus M5A88-V EVO does via its ‘Super’ setting)? If so, are you taking advantage of that feature or does it impair the performance of running computational tasks?

Just so I can learn more on my own time, what are the makes & model #s of the CPU, RAM and motherboard you are using? I’d like to know more about how pro-grade gear interacts with ECC RAM.

BTW. This is what woke me up to the value of using such RAM:

The following lecture is by Artem Dinaburg, a guy who works for Raytheon Company, a major U.S. ‘defense’ contractor:

"It turns out that non-ECC RAM is actually a security risk, as bit flips can be exploited. “Bit-squatting” from Black Hat 2011:

Mar 15, 2013
Blackhat 2011 - Bit-squatting: DNS Hijacking without exploitation - YouTube

Bitsquatting: DNS Hijacking without exploitation
[url]http://dinaburg.org/bitsquatting.html[/url]

“…As the graph above shows, ECC RAM has a much lower failure rate than non-ECC RAM. The ~1% failure rate of the Kingston non-ECC RAM is still very, very good (which is why we primarily use Kingston), but the ECC RAM is even better at an average .24% failure rate…”

November 5, 2013
Advantages of ECC Memory - Puget Custom Computers
[url]http://www.pugetsystems.com/labs/articles/Advantages-of-ECC-Memory-520/[/url]

jeremyrutman · February 13, 2017, 7:00pm

This was a ‘new’ server (an azure cloud server that I started using Feb. 1).

I ran across this [url]https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf[/url] interesting paper while looking for whether i can safely turn off ecc.

As for the other info you were after, this is a GPU I’m talking about not a cpu.

Anyway thanks for the interest

JGB123321 · February 13, 2017, 11:59pm

Thank you for the clarification and the DRAM Errors in the Wild: A Large-Scale Field Study.pdf link.

Topic		Replies	Views
What to do with GPUs with ECC errors? Linux linux , gpu-computing	1	452	January 27, 2025
RTX Ada 2000E uncorrected ECC fault simulation Linux	2	87	June 27, 2025
How to enable ECC on RTX A4000 Linux	9	5727	February 25, 2025
Tool to find out the cause of CUDA error CUDA Setup and Installation	7	5344	October 12, 2021
Strange ECC mode reported by nvidia-smi.exe CUDA Programming and Performance	6	8631	November 15, 2018
nvidia-smi on Amazon EC2 Cannot stop ECC for second GPU CUDA Programming and Performance	4	14779	February 4, 2011
Disable ECC Memory in CUDA CUDA Programming and Performance	0	2157	May 19, 2017
Enable ECC on RTX 4090 on Ubuntu 22.04 LTS Linux cuda , ubuntu	4	3917	January 22, 2024
HELP NEEDED: Uncorrectable ECC memory error General Discussion	2	1230	September 17, 2021
ECC error occurs when running cuda code on P100 CUDA Programming and Performance cuda	4	5795	July 1, 2022

uncorrected ECC - how to get back on track

Related topics