Computing the probability of ECC errors on a GTX GPU

CudaaduC · January 11, 2016, 4:46am

Have to justify the use of the Tesla GPU line for a client, and for my use case the main compelling argument is avoiding ECC errors.

Have read through the Google results such as this;

https://en.wikipedia.org/wiki/ECC_memory

and seen the counterpoint argument (for the simulation case) made by Amber;

http://ambermd.org/gpus/#Accuracy

Assuming that a GTX Titan X will not be running 24/7, rather in short batches 5 days a week, how likely are ECC errors likely to cause a serious issue?

Even if this statement is true;

about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate

the probability that the specific memory affected by the errors is used for a result is quite small.

I am not minimizing the risk associated with non-ECC corrected memory, rather looking for recent resources from which I can generate some reasonable prediction of the occurrence of ECC related errors in an image processing pipeline.

I know njuffa has seen ECC errors occur with some frequency.

Anyone out there run into ECC errors with the GTX Titan X or any Maxwell GPU?

Is there a diagnostic method via software which can be used to either detect an ECC error or to determine the probability for a given set of memory that an ECC error may occur in a given timeframe?

Is DDR5 RAM any more or less likely to be affected than DDR4 or DDR3 CPU memory?

njuffa · January 11, 2016, 6:50am

I am not sure what exactly I might have said previously about seeing ECC errors on Tesla GPUs. I have certainly seen memory errors in GPUs without ECC (detected as particular failure patterns in regression tests), across a largish and diverse collection of GPUs operating more or less continuously. I don’t have any formal statistics, but the rate of these failures was much lower than the “5/8 single-bit errors per hour per GB” quoted in the initial post in this thread.

Some of these errors were one-time transient events, which I chalked up to “cosmic rays” knocking out a bit. Others were reproducible, attributed to “memory going bad” (all electronic devices age physically, but I am not familiar with the specific failure mechanisms in DRAM).

One of the people who operate the Titan supercomputer at Oakridge gave a presentation on ECC errors observed on that system at the most recent GTC. He presented statistics for the entire machine which one should be able to convert, with some validity, to per-GPU statistics. As I recall, they found that errors are not distributed uniformly, and error rates are affected by adverse operating conditions. The slides for this talk can be found here: [url]http://on-demand.gputechconf.com/gtc/2015/presentation/S5566-James-Rogers.pdf[/url].

ECC as implemented on Tesla GPUs is of the single-bit error correct, double-bit error detect variety, so the much more common single-bit events will not adversely affect operation of the GPU. The error will be fixed and recorded by incrementing a counter that can be read out with nvidia-smi. On GPUs without ECC, the impact of a single-bit error will very much depend on the avalanching behavior of the computation and the relevance of individual incorrect results to the validity of the final output. I could imagine that in a Monte-Carlo driven simulation such errors have no measurable impact, as each result is the combination of numerous randomly selected “paths”.

I have not seen any paper comparing error rates between DDR3, DDR4, and GDDR5. Some years back, Google published the results of an extensive study on memory errors in their servers: [url]http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf[/url].

HannesF99 · January 11, 2016, 10:52am

See also the paper ‘Understanding GPU Errors on Large-scale HPC Systems and the Implications for
System Design and Operation’ (Tiwari et al, http://www4.ncsu.edu/~dtiwari2/Papers/2015_HPCA_Tiwari_GPU_Reliability.pdf ). Note it is aimed for large-scale systems, not for a single GPU system.

Topic		Replies	Views
scaling a single bit error rate linearly CUDA Programming and Performance	6	1321	July 26, 2018
Error Checking and Correction CUDA Programming and Performance	9	17177	July 30, 2008
Choosing type of Nvidia GPU CUDA Programming and Performance	4	1016	December 1, 2014
V100 ECC Error Linux	15	3475	May 15, 2020
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	755	July 2, 2024
What to do with GPUs with ECC errors? Linux linux , gpu-computing	1	505	January 27, 2025
memory protection CUDA Programming and Performance	4	1992	July 23, 2014
ECC error occurs when running cuda code on P100 CUDA Programming and Performance cuda	4	5827	July 1, 2022
Memory errors on Tesla K20c, GTX Titan (but not on GTX680) Linux	0	1054	June 11, 2014
A Motivation for ECC CUDA Programming and Performance	15	4167	October 8, 2009

Computing the probability of ECC errors on a GTX GPU

Related topics