Error Checking and Correction

Drew.Woods · July 30, 2008, 9:59am

Hi,

I was wondering if there is any error checking or error correction in GPUs, to insure consistent results? I would guess that this isn’t an important feature for graphics, so I’m wondering if it is omitted.

cbuchner1 · July 30, 2008, 10:05am

I am pretty sure they don’t do any ECC on their GDDR RAM on the consumer boards because of the extra costs involved.

For certain kinds of scientific calculations one might perform some manual consistency checking, for example checking for conservation of momentum and energy in N-body simulations - but that adds extra overhead.

Drew.Woods · July 30, 2008, 10:26am

Thanks for the reply cbuchner1

When you say consumer products would this include the Telsa range?

cbuchner1 · July 30, 2008, 1:22pm

I have no in-depth knowledge about the difference between Tesla and the consumer graphics boards. All I know is that they usually have more memory on board. Whether there are any other subtle differences besides the missing DVI/VGA ports, I cannot say.

MisterAnderson42 · July 30, 2008, 3:02pm

I don’t know what the details of what the hardware does or does not do for error checking, but I can offer anecdotal evidence. My app is an n-body type simulation usually run for millions of steps, the soluntions are chaotic meaning that even a 1-bit change in the lowest precision bit of a floating point value will lead to drastically different (though statistically similar) results after only a few thousand steps. I have never noticed such a difference on the GPU. Every single repeat run from the same initial conditions I’ve checked is bit for bit equal to the others. I’ve seen this behavior on both Tesla and 8800 GTX.

Overclocked GPUs are another story. I usually get crashes or NaN results within a few 1000 steps, which I presume is due to occasional bit errors that wouldn’t effect gaming as you say. But in an app where the kernel loops over a value read from memory, a 1-bit error can turn that loop into a infinite loop and an apparent crash.

At stock clocks, the GPUs are very stable. I’ve had simulations running 24/7 for several weeks now without any problems.

SPWorley · July 30, 2008, 8:52pm

I can give anecdotal confirmation of that as well… I have a simulation cooking which also is bit-sensitive and it’s churned 24/7 for a month without a squeak on a 280GTX even with an underpowered PSU. The computes are double-checked because the GPU is identifying candidates and the CPU is used to verify that the candidates actually pass the criterion the GPU said they did.

I also ran some simple tests between the 280GTX and my laptop’s GPU, those were done manually one afternoon but they were bit-for-bit matching too.

So it’s not a guarantee, but GPU results do seem reliably predictable.

tmurray · July 30, 2008, 8:59pm

The story on differences between Tesla and GeForce in terms of memory:

No, we do not have ECC at the moment. However, we do take a number of steps to ensure zero bit errors on Tesla compared to GeForce. First, we use memory that’s tested to much higher specifications on Tesla (I believe this means much longer at higher temps AND higher frequencies). If there are any problems whatsoever, it gets thrown out. Second, we lower the clock on our memory on Teslas compared to what the memory is actually specced for in order to guarantee that there are no bit errors (so not only did we not see any when testing at a higher frequency, we also underclock the memory).

So, Tesla is more reliable than GeForce in terms of memory. That’s not to say that GeForce is bad, just that it’s not tested as thoroughly (if you have a bit error on your average GeForce, who cares? it’s a pixel that appears for 1/30 of a second!).

Finally, since Tesla has been introduced, we’ve never seen a bit error, either in our internal testing or reported to us by customers. We haven’t seen a bit error on GeForce, either.

cbuchner1 · July 30, 2008, 9:09pm

Okay, but that random cosmic ray streak is just waiting to hit your DRAM cell and flip the bit. So I guess until that ECC feature is available, one should not control an automated brain surgery robot with CUDA.

;-)

eelsen · July 30, 2008, 9:26pm

No, you just have two different cards perform the same calculation and if they don’t agree you do it again. Then you start cutting out brain.

But what if there are TWO cosmic rays and they happen to flip the same bit on two different cards…?

cbuchner1 · July 30, 2008, 9:34pm

Good catch, I wasn’t thinking about redundancy yet. You could also run the same calculation twice on one piece of hardware - but that would allow a reproducable hardware malfunction to trigger the same error twice. So the idea of two cards which have to agree is nice.

Could also use 3 cards and use a majority vote - so the brain surgery proceeds even in case of one fault.

A giant meteor might as well hit earth and wipe out all of civilization. That kind of messes up the surgery as well. What was the original topic again? lol