scaling a single bit error rate linearly

oguzozanslncr · July 26, 2018, 2:43pm

Hello;

[url]https://devtalk.nvidia.com/default/topic/909860/cuda-programming-and-performance/computing-the-probability-of-ecc-errors-on-a-gtx-gpu/post/4777916/#4777916[/url]

I know that this result from single bit error probability rate is 5 per 8gb per hour.

Can i scale this result linearly to a 2 gb card memory like 5/4 per 2gb per hour if i assume the card models are the same but the memory sizes of the cards are different ?

If this result can be linearly scaleable could you also give me a referance about that?

njuffa · July 26, 2018, 6:09pm

The reason people use normalized error rates is because random one-bit errors caused by “cosmic rays” scale linearly with the amount of time passed and the size of the memory, as long as all memory involved is of the same type. Note that environmental factors may play into the error rate: it will increase with altitude, for example, as there is less shielding provided by the atmosphere.

As I pointed out in the thread you referenced, the error rate cited (5 / 8GB / hour) seems way too high to me. I have used GPUs with ECC, which offer SECDED (single error correct, double error detect) capability and count these events, and on no such card did the single error count increment by a couple of errors an hour or anywhere close to that. My personal guessstimate of the typical error rate would be something like one single-bit error per TB per day.

If you are doing an evaluation of error probability for safety reasons, I would urge you to consult the literature (e.g. refereed journals) or vendor data sheets for reliable data, rather than querying strangers on the internet.

oguzozanslncr · July 26, 2018, 7:07pm

njuffa;
Thank you for replying
On second thought i also agree to you. These rates are way too high. But I searched most of the journals about fault tolerances, soft and hard errors and but none of the journals that i saw has an error rate mentione. The only rate i saw was that thread for now. For research purposes i need a reasonable error rate for Kepler architecture.

Nvidia has some paper about this topic but not with the rates.
The link at amber molecular dynamic simulation program’s web site which you also made a comment, doesnt include any rates about that.

Also the oak ridge national laboratory has many search about gpu errors which includes many interesting results but their research also doesnt include an error rate.

Quote from the presentation:
Total Errors reported: 6,088,374
• Only 899 of 18,688 SXMs reported
SBEs.
• 98% of the single bit errors were
confined to 10 cards.

I can understand that not every card has the same error rates. Some cards with their work frequencies handles computation with less error. There are enviromental effects like magnetic and electric field, thermal neutrons which causes card to make fission, dust, electric current ect.
javascript:void();
I am clueless right now cause i dont have any rates. That is why I am desparate to find any rates scientific or not.

njuffa · July 26, 2018, 7:25pm

Other than susceptibility to environmental factors, manufacturing variations may be at play. The enormous concentration of single-bit errors in just ten GPUs reported in the presentation seems suspicious to me, and does not jibe with the “cosmic ray” model of DRAM errors (a DRAM cell hit by a ray discharges, losing its storage). Something else would seem to be at play. (*)

I have no idea what sort of statistical distribution bit error rates are supposed to have based on other underlying mechanisms of action. In practical terms, the only people with sufficient data are likely hyperscalers like Google and AWS, or operators of massive GPU-acelerated supercomputers. And if they are not revealing their data in detail, you are out of luck.

The data you quoted from the presentation would allow you to estimate an upper bound of the error rate provided (1) it is stated over what span of time this data was collected (2) the amount of memory on each SXM is known. Then max bit error rate = 6,088,374 single-bit error * 0.98 * 0.1 / memory per SMX / hours of operation.

Looking at the slides, it seems the SMXs are K20s with 6 GB each, and the observational time was 22 months (2012 through 2014). If I punched in the numbers correctly, this gives a rate of about six single-bit errors / GB / hour for the ten worst GPUs. Excluding these ten worst GPUs, the error rate was only 6,088,374 single-bit errors * 0.02 / 18678 / 6 GB / 16104 hours ~= 6.7e-5 single-bit errors / GB / hour ~= 1.6 single-bit errors / TB / day, not far off my guesstimate.

What do you need the Kepler error rates for? Personally, I have a very conservative approach: if there is a possibility that an undetected bit error could lead to serious consequences, say an incorrect cancer diagnosis or a collapsing bridge, I want ECC in my hardware, everywhere. And I want the machine to halt on detection of a double-bit error.

(*) [Later:] This is actually covered in the presentation (slide 14; emphasis mine):
“Looking at the 10 SXMs that accounted for 98% of all SBEs, 99% of the errors on those 10 SXMs occurred on L2 cache. This is a clear indication of test escapes (5.35%) for the L2 cache. Removing the 10 worst cards, the distribution of the remainder of the errors is dominated by the device memory (GDDR5).”

oguzozanslncr · July 26, 2018, 9:05pm

I also have GTX 670 which is a kepler GK104 card. That is why i need a rate with kepler cards to use that rates on my card.

Thank you for the peak error rates which you produced from the data.
Then I will use these data to produce peak, min and avr error rates like you did.
But i just mentioned that oak labs accelerators are base base on Kepler GK110, not GK104.
So idk if this oak lab data would be useful for me.

oguzozanslncr · July 26, 2018, 9:19pm

Thank you for these formulas i was about to start calculate these.

Yes; appearently that 10 card has worse l2 cache than other cards. I am unsure if that counts as failing.

njuffa · July 26, 2018, 9:39pm

I don’t know what you are looking for. The error rate for the DRAM used on GPUs, or the overall likelihood of single-bit errors anywhere in a GPU (so not just the attached DRAM).

I am not sure the ten faulty GPUs found on the supercomputer can be considered representative. Supercomputers often receive the initial batches of new GPUs. A recent example would be Summit, which absorbed the majority of the new Tesla V100 devices (27,648 in all). Generally speaking, early production runs of semiconductor devices often suffer from imperfect manufacturer test screens that are subsequently perfected based on returns from the field or longer-term studies by the vendor.

I would tend to think (without having any further evidence other than extrapolated personal experience that lead to my earlier guesstimate) that the DRAM single-bit error rate of the 18678 non-defective K20s in the study from the slide deck is representative of what the general buying public would encounter with Kepler-based GPUs. The attached memory is GDDR5 for all of them (OK, some extremely low-end GPUs may use DDR3, but is seems those are of no interest to the OP).

Topic		Replies	Views
Computing the probability of ECC errors on a GTX GPU CUDA Programming and Performance	2	7995	January 11, 2016
A Motivation for ECC CUDA Programming and Performance	15	4173	October 8, 2009
Error Checking and Correction CUDA Programming and Performance	9	17181	July 30, 2008
V100 ECC Error Linux	15	3516	May 15, 2020
Failure rates of GeForce vs Tesla CUDA Programming and Performance	6	3274	August 11, 2014
Importance of ECC memory Legacy PGI Compilers	2	2225	May 10, 2011
GPU Memory Less Than Promised CUDA Programming and Performance	19	3555	December 15, 2022
An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, ... CUDA Setup and Installation	1	4972	October 24, 2016
CUDA precision of desktop GPU CUDA Programming and Performance	9	2731	January 22, 2013
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17326	November 2, 2012

scaling a single bit error rate linearly

Related topics