Computing the probability of ECC errors on a GTX GPU

Have to justify the use of the Tesla GPU line for a client, and for my use case the main compelling argument is avoiding ECC errors.

Have read through the Google results such as this;

https://en.wikipedia.org/wiki/ECC_memory

and seen the counterpoint argument (for the simulation case) made by Amber;

http://ambermd.org/gpus/#Accuracy

Assuming that a GTX Titan X will not be running 24/7, rather in short batches 5 days a week, how likely are ECC errors likely to cause a serious issue?

Even if this statement is true;

about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate

the probability that the specific memory affected by the errors is used for a result is quite small.

I am not minimizing the risk associated with non-ECC corrected memory, rather looking for recent resources from which I can generate some reasonable prediction of the occurrence of ECC related errors in an image processing pipeline.

I know njuffa has seen ECC errors occur with some frequency.

Anyone out there run into ECC errors with the GTX Titan X or any Maxwell GPU?

Is there a diagnostic method via software which can be used to either detect an ECC error or to determine the probability for a given set of memory that an ECC error may occur in a given timeframe?

Is DDR5 RAM any more or less likely to be affected than DDR4 or DDR3 CPU memory?

I am not sure what exactly I might have said previously about seeing ECC errors on Tesla GPUs. I have certainly seen memory errors in GPUs without ECC (detected as particular failure patterns in regression tests), across a largish and diverse collection of GPUs operating more or less continuously. I don’t have any formal statistics, but the rate of these failures was much lower than the “5/8 single-bit errors per hour per GB” quoted in the initial post in this thread.

Some of these errors were one-time transient events, which I chalked up to “cosmic rays” knocking out a bit. Others were reproducible, attributed to “memory going bad” (all electronic devices age physically, but I am not familiar with the specific failure mechanisms in DRAM).

One of the people who operate the Titan supercomputer at Oakridge gave a presentation on ECC errors observed on that system at the most recent GTC. He presented statistics for the entire machine which one should be able to convert, with some validity, to per-GPU statistics. As I recall, they found that errors are not distributed uniformly, and error rates are affected by adverse operating conditions. The slides for this talk can be found here: http://on-demand.gputechconf.com/gtc/2015/presentation/S5566-James-Rogers.pdf.

ECC as implemented on Tesla GPUs is of the single-bit error correct, double-bit error detect variety, so the much more common single-bit events will not adversely affect operation of the GPU. The error will be fixed and recorded by incrementing a counter that can be read out with nvidia-smi. On GPUs without ECC, the impact of a single-bit error will very much depend on the avalanching behavior of the computation and the relevance of individual incorrect results to the validity of the final output. I could imagine that in a Monte-Carlo driven simulation such errors have no measurable impact, as each result is the combination of numerous randomly selected “paths”.

I have not seen any paper comparing error rates between DDR3, DDR4, and GDDR5. Some years back, Google published the results of an extensive study on memory errors in their servers: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf.

See also the paper ‘Understanding GPU Errors on Large-scale HPC Systems and the Implications for
System Design and Operation’ (Tiwari et al, http://www4.ncsu.edu/~dtiwari2/Papers/2015_HPCA_Tiwari_GPU_Reliability.pdf ). Note it is aimed for large-scale systems, not for a single GPU system.