A Motivation for ECC

seibert · October 6, 2009, 7:25pm

Google just published a new paper on their measurement of bit error rates in RAM in their clusters:

[url=“http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf”]http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf[/url]

The conclusion is that the error rate is (converting units to something more familiar to me) between 150 and 400 bit errors per GB per month. That’s 100x larger than numbers I’ve heard in the past.

Anyway, another reason to look forward to Fermi…

SPWorley · October 6, 2009, 8:02pm

Google did a similar (and even more interesting) analysis of hard drive failures a couple years back.

GPU memory may be very different than CPU DDR3 memory. Perhaps more, perhaps less, sensitive… who knows.
It would be fun to make a quick kernel that did a crude mem test that would run for a week, counting errors. This wouldn’t give you GPU quality stats really (with only one GPU) but it may give you cosmic ray measurements. It’d be interesting also to see if Denver has a lot more flips than a coastal city. (That extra 2km of dense atmosphere is a shield!)

You could simultaneously test registers and shared memory for the same kinds of flips over a long period of time. Fermi protects all types.

I do know someone made a RAM checker for the GPU based on memtest, but an even simpler kernel could do a long term accumulator. Just zeroing out 500MB of ram and repeatedly polling it for a single set bit would be a pretty valid cosmic ray tester.

(ha, it’d be fun for someone with a cesium gamma-ray source to move it towards the GPU to see if you can INDUCE radiation based RAM flips…)

Demq · October 6, 2009, 8:56pm

Hmm, but did you read their conclusions?
" Conclusion 7: Error rates are unlikely to be dominated by soft errors." So ECC or no ECC there still will be a bunch of hard errors present.

seibert · October 6, 2009, 9:04pm

If hard errors are correctable with a rewrite, then ECC can still help. I don’t believe hard errors have to necessarily mean bad cells. It just means the bit error is repeatable on consecutive reads.

seibert · October 6, 2009, 9:07pm

Semiconductor companies regularly bring chips to our high luminosity proton beam here at LANL to do tests like this for radiation-hardened products. :)

(And we do have a number of gamma sources downstairs, but I don’t get to play with them…)

Demq · October 6, 2009, 10:32pm

I would bet on something more ionizing, like a charged particle beam :)

SPWorley · October 7, 2009, 1:42am

OK, here’s a 10 minute hack to look for cosmic rays RAM bit flips on your GPU.
It allocates and zeros a block of RAM on your GPU (say 750 MB) and then sleeps 10 minutes, and then checks the whole block to look for bit flips.
It runs this millisecond-long check only once every 10 minutes, so it won’t impact your CPU or GPU usage… you can just let this sit in a corner and cook for a month.
The more RAM you give it, the more it can check and the more likely it will be to find a flipped bit, but the less RAM will be available to any other GPU programs.

This is public domain.
For Windows, change the sleep function to Sleep() and include <windows.h>.

For real GPU memory quality checking, you might use GPUmemtest. This hack is different and just for long run simple detection of drifting bit values.
cosmic.cu (2.54 KB)

nitin.life · October 7, 2009, 5:10am

Thanks… looks great will give it try first thing tomm morning.

cbuchner1 · October 7, 2009, 8:06am

And in other news: A distributed cluster of nVidia GPU successfully detects a gamma ray burst in the constellation of Cygnus using a tool written by SPWorley ;)

Christian

jma · October 7, 2009, 10:42pm

Hey, thats not the way to go ghost-hunting in memory. If all bits are set to zero, then they will all help adjacent bits to stay zero as well (even if the real value of one of them bits might have been closer to 0.4999 and close to flipping …)

SPWorley · October 7, 2009, 11:08pm

That’s an interesting question, actually. Will a stray cosmic gamma ray tend to flip a 0 to a 1 by charging the potential cell? Would it tend to discharge a cell? Do neighboring bits “help” keep a bit stable?

From an armchair thought experiment, I’d expect that gamma rays would cause 0 bits to flip to 1 by penetrating the PC and hardware and finally colliding with the DRAM substrate itself, causing a shower of electrons to be ejected, and if that happens to be physically inside a DRAM cell, they can be trapped and your cell becomes 1. Of course maybe DRAM encodes a charged cell as 0 for all I know… I’m sure there’s outer space computer engineer dudes who write papers on this stuff all the time.

Seibert, you need to get one of those Cs-137 sources and do some experiments for us. I’m not even especially joking… it WOULD be interesting… in college I know we used quite safe low-intensity Cs sources sometimes, safe enough to let undergraduates use them anyway.

Plus it would be cool, you could plaster radiation hazard stickers all over your workstation. You’d be the envy of all the kids at the LAN parties. (Well, that and your 4xGTX 295 luggable Manifold machine.)

Steve

jma · October 7, 2009, 11:27pm

The thing is that no bit is neither zero nor one, but instead anything in between … They are either closer to the one than the other though.

seibert · October 8, 2009, 3:08am

(Hah, this thread is strangely starting to wander into particle physics, my day job…)

One minor detail here: the majority of cosmic rays at the surface of the Earth are muons, which have the penetrating power to get through our atmosphere. (You do get some gammas from electromagnetic showers caused by collisions in the upper atmosphere.) As was mentioned, high-energy charged particles are quite good at ionizing matter as they pass through it. Gammas don’t really shower unless you get up to really high energies, and even then, do so over long enough distances it would be hard to get enough electrons displaced in the vicinity of a single bit. High purity germanium crystals (a lot like the silicon crystals grown for chip manufacturing) are used as gamma detectors, need to be centimeters in size to have a reasonable probability of capturing the gamma energy.

This DRAM-bit flipping scheme is getting close to the idea behind the silicon vertex tracker, which is found at the innermost layer of pretty much any particle accelerator experiment. Charged particles flying out of the collision region will leave a nice ionization trail behind them, and a finely segmented semiconductor device can easily pick out the displaced charge left in reverse-biased diode strips masked onto the silicon wafers.

We certainly have plenty of low intensity sources available (for testing germanium detectors), but I don’t think I’ll see much action from a gamma source. I’ll put it on my “slow Friday afternoon activity” list. :)

Heh, one should not casually put radiation hazard stickers on equipment around LANL. Management tends to not have a sense of humor about such things. :) (Not to mention that plenty of things here have that sticker for real! I’d hate for my fancy GPU workstation to get quarantined while someone went over every inch of it with a Geiger counter…)

seibert · October 8, 2009, 3:38am

Hah, not surprisingly, it looks like we were all wrong. Some googling suggests that the main cause of DRAM errors from cosmic rays are the secondary neutrons (usually produced by muons smacking into nuclei):

[url=“http://www.worldscibooks.com/etextbook/6661/6661_chap01.pdf”]http://www.worldscibooks.com/etextbook/6661/6661_chap01.pdf[/url]

Muons and gammas can’t displace enough charge per unit distance to affect a cell, but a neutron-induce nuclear recoil can. (Or an alpha decay, to a lesser extent.)

SPWorley · October 8, 2009, 4:11am

Actually that is indeed pretty interesting… you never know what you’re going to learn in the NV CUDA forum!

So gotcha, we shouldn’t mount our GPUs to the walls of our fast reactor chambers, or we’ll see artifacts when we’re playing Quake. I’ve been meaning to move my reactor down into the basement anyway.

Sarnath · October 8, 2009, 5:27am

So, better to cadmium coat our computer cases…

Topic		Replies	Views
scaling a single bit error rate linearly CUDA Programming and Performance	6	1204	July 26, 2018
Computing the probability of ECC errors on a GTX GPU CUDA Programming and Performance	2	7823	January 11, 2016
Error Checking and Correction CUDA Programming and Performance	9	17026	July 30, 2008
Improving GPU Performance by Reducing Instruction Cache Misses Technical Blog	7	75	April 22, 2025
Putting the GPU at work CUDA Programming and Performance	21	20179	July 5, 2007
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4709	June 22, 2011
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22365	May 5, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11172	May 23, 2010
CUDA precision of desktop GPU CUDA Programming and Performance	9	2633	January 22, 2013
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29440	April 29, 2010

A Motivation for ECC

Related topics