Failure rates of GeForce vs Tesla

In my company we have a new product line we are developing which will contain a CUDA element, and currently we are trying to spec up what exactly will be needed from a hardware point of view. I am trying to argue the benefits of going for a Tesla solution (K20 most likely) over a standard GeForce Titan option. With the addition of the Tesla doubling the cost of the PC component it requires some significant justification as to why we should spend an extra £2000 for no performance gain. My current arguing points are:

  • Increased Nvidia support
  • Potential Performance gain from bypassing WDDM
  • Increased Reliability

From my managers point of view the first point is minimal, the second point may be worthy of a couple hundred quid if it can provide significant performance boost but it was the last one that I think the decision hinges on. Unfortunately my hand-wavey arguments that the Titans are unlikely to survive being run 24/7 for months at a time, without concrete numbers I am fighting a losing battle.

Does anyone have any figures on failure rates of Teslas vs GeForce cards? MTBF of transient (bitflip, card reboot) and catastrophic (dead card) failures would be awesome?

tl;dr
Need to justify extra cost of Tesla over Titan. Failure rates/Calculation or Memory error rates would be incredibly helpful.

In the Tesla K20 document at http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Active-BD-06499-001-v04.pdf there is given a MTBF number (page 7, Table 1. Board Configuration)

Thanks, it is a start. I cannot find any information on GeForce MTBFs, or whether they are transient or catastrophic failures.

The main problem with getting MTBF values for GeForce cards is the really wide range of cooling solutions and clock rates. All the manufacturers struggling to distinguish their gamer cards with 7 different models make it pretty hard to get useful statistics on any particular configuration. I also doubt their manufacturing processes are consistent enough to make 3rd party estimates reliable over the product lifetime.

My anecdotal experience (about 15 different GeForce cards spread over all the generations of CUDA devices, with only 2 deaths) has been that the failure rates of standard-clocked GeForce cards are not high enough to justify alone the price difference between Tesla and GeForce. Assuming a useful lifetime of 4 years for a Tesla GPU, you could afford to burn out a Titan card every 2 years and still come out ahead financially.

The real value-add for Tesla is the non-WDDM driver (if you have to use Windows) and the ECC device memory if the risk (probability * cost) associated with a memory error is high for your application.

For generic memory error rates (not GPU specific), this paper seems to be one of the more recent examples:

http://softerrors.info/selse/images/selse_2012/Papers/selse2012_submission_4.pdf

They see approximately 1 correctable memory error (a single bit flip that can be fixed by ECC) per gigabyte of memory per month. So each Titan is probably going to see a bit flip somewhere in its 6 GB of memory (possibly somewhere you aren’t using) every 5 days on average. Whether that is a compelling reason to go for Tesla depends on the length of your calculations, how much memory you use, how many GPUs are involved, and the cost to your organization if a bit flip occurs during a calculation.

The bit flips are not evenly distributed. In a server environment, about 10% of DIMMs will see any kind of errors (correctable or not) over their lifetimes, however, those 10% will have A LOT of them.

I saw the stats for GeForce somewhere. The soft errors there are MUCH, MUCH more likely (Perhaps because it’s DDR5?) Based on that, I wouldn’t recommend using non-ECC GPUs unless

  1. the errors don't matter (cryptocurrency, etc.), or
  2. you can detect errors reliably in software (e.g. molecular dynamics people can test the total energy), or
  3. you use 2 GPUs and compare the results

Another subtle issue to consider is that Teslas can transfer data to and from the device concurrently, effectively doubling the maximum host-device bandwidth compared to non-Teslas.

Thanks for the responses, it pretty much matches what my gut feeling was. Hardware failures alone are not enough to justify the significant premium.

This then leaves the WDDM speed-boost. My gut feeling here is that it may give up to a 20% speed-up since we use Windows 7, which again is probably not sufficient to justify the cost since you can simply get two GeForce cards and parallelize the problem over multiple cards.

EEC is probably the wild card for me. Currently a bit flip is a minor annoyance with the odd reconstruction failing (NaNs everywhere), but that can be detected after the fact and the pass simply redone. This said in future we may be heading into markets where accuracy is more important. I think the likely outcome is staying with the Titans until someone throws a lot of money at us.