Tesla GPU lifetime How long are they supposed to run?

_device · April 7, 2011, 11:16pm

Hi All,

We have just noticed that one of our Tesla C1060 cards sitting in a GPU node in a supercomputer room started providing wrong results. All the other cards in the same node are running perfectly fine. The error is of the order of 0.001-0.0001%, i.e. quite small, but it’s large enough to make these results inapplicable to, for example, quantum chemistry. The sad thing is that we have to trash all data obtained on this node since we don’t know when it started to happen. This node has been quite heavily used for about 1-2 years.

Does anybody know things like mean time between unit replacement for Tesla GPUs (1060, 2050, etc)? I’ll very much appreciate it.

Just a suggestion, maybe Nvidia people should create an express test suite for Tesla GPUs to run it periodically on cluster nodes and check hardware sanity… I understand that everyone can write it for their own use (on the other hand, am pretty much sure that 95% system administrators won’t bother at all) but maybe a more standard tool could be more appropriate. Especially because Tesla is a professional-type product.

To give you a flavor of what’s going on, I ran the same program 5 times on the faulty card and here are the results it produced

-3754.4167996040
-3754.4301206565
-3754.4239919407
-3754.4585621605
-3754.4514835543

For comparison, the right value -3754.4584818935. All other GPUs in the node provide exactly the same result, all 14 digits.

hamster143 · April 8, 2011, 12:51am

These guys

http://tsubame.gsic.titech.ac.jp/en/tsubame2-system-architecture

have 4000 Tesla M2050’s under heavy load. According to their maintenance logs, they had to replace 13 of them in the last 3 months. That’s mean lifetime of about 25 years. Not bad, I think.

Edit: I’m not sure how that happened, but, of course, that should be 75 years, not 25.