We have just noticed that one of our Tesla C1060 cards sitting in a GPU node in a supercomputer room started providing wrong results. All the other cards in the same node are running perfectly fine. The error is of the order of 0.001-0.0001%, i.e. quite small, but it’s large enough to make these results inapplicable to, for example, quantum chemistry. The sad thing is that we have to trash all data obtained on this node since we don’t know when it started to happen. This node has been quite heavily used for about 1-2 years.
Does anybody know things like mean time between unit replacement for Tesla GPUs (1060, 2050, etc)? I’ll very much appreciate it.
Just a suggestion, maybe Nvidia people should create an express test suite for Tesla GPUs to run it periodically on cluster nodes and check hardware sanity… I understand that everyone can write it for their own use (on the other hand, am pretty much sure that 95% system administrators won’t bother at all) but maybe a more standard tool could be more appropriate. Especially because Tesla is a professional-type product.
To give you a flavor of what’s going on, I ran the same program 5 times on the faulty card and here are the results it produced
For comparison, the right value -3754.4584818935. All other GPUs in the node provide exactly the same result, all 14 digits.