Failure rates of GeForce vs Tesla

Tiomat · August 7, 2014, 10:03am

In my company we have a new product line we are developing which will contain a CUDA element, and currently we are trying to spec up what exactly will be needed from a hardware point of view. I am trying to argue the benefits of going for a Tesla solution (K20 most likely) over a standard GeForce Titan option. With the addition of the Tesla doubling the cost of the PC component it requires some significant justification as to why we should spend an extra £2000 for no performance gain. My current arguing points are:

Increased Nvidia support
Potential Performance gain from bypassing WDDM
Increased Reliability

From my managers point of view the first point is minimal, the second point may be worthy of a couple hundred quid if it can provide significant performance boost but it was the last one that I think the decision hinges on. Unfortunately my hand-wavey arguments that the Titans are unlikely to survive being run 24/7 for months at a time, without concrete numbers I am fighting a losing battle.

Does anyone have any figures on failure rates of Teslas vs GeForce cards? MTBF of transient (bitflip, card reboot) and catastrophic (dead card) failures would be awesome?

tl;dr
Need to justify extra cost of Tesla over Titan. Failure rates/Calculation or Memory error rates would be incredibly helpful.

HannesF99 · August 7, 2014, 3:25pm

In the Tesla K20 document at Page Not Found | NVIDIA there is given a MTBF number (page 7, Table 1. Board Configuration)

Tiomat · August 7, 2014, 3:30pm

Thanks, it is a start. I cannot find any information on GeForce MTBFs, or whether they are transient or catastrophic failures.

seibert · August 11, 2014, 2:31am

The main problem with getting MTBF values for GeForce cards is the really wide range of cooling solutions and clock rates. All the manufacturers struggling to distinguish their gamer cards with 7 different models make it pretty hard to get useful statistics on any particular configuration. I also doubt their manufacturing processes are consistent enough to make 3rd party estimates reliable over the product lifetime.

My anecdotal experience (about 15 different GeForce cards spread over all the generations of CUDA devices, with only 2 deaths) has been that the failure rates of standard-clocked GeForce cards are not high enough to justify alone the price difference between Tesla and GeForce. Assuming a useful lifetime of 4 years for a Tesla GPU, you could afford to burn out a Titan card every 2 years and still come out ahead financially.

The real value-add for Tesla is the non-WDDM driver (if you have to use Windows) and the ECC device memory if the risk (probability * cost) associated with a memory error is high for your application.

For generic memory error rates (not GPU specific), this paper seems to be one of the more recent examples:

http://softerrors.info/selse/images/selse_2012/Papers/selse2012_submission_4.pdf

They see approximately 1 correctable memory error (a single bit flip that can be fixed by ECC) per gigabyte of memory per month. So each Titan is probably going to see a bit flip somewhere in its 6 GB of memory (possibly somewhere you aren’t using) every 5 days on average. Whether that is a compelling reason to go for Tesla depends on the length of your calculations, how much memory you use, how many GPUs are involved, and the cost to your organization if a bit flip occurs during a calculation.

alexgg · August 11, 2014, 3:19am

The bit flips are not evenly distributed. In a server environment, about 10% of DIMMs will see any kind of errors (correctable or not) over their lifetimes, however, those 10% will have A LOT of them.

I saw the stats for GeForce somewhere. The soft errors there are MUCH, MUCH more likely (Perhaps because it’s DDR5?) Based on that, I wouldn’t recommend using non-ECC GPUs unless

the errors don't matter (cryptocurrency, etc.), or
you can detect errors reliably in software (e.g. molecular dynamics people can test the total energy), or
you use 2 GPUs and compare the results

alexgg · August 11, 2014, 3:35am

Another subtle issue to consider is that Teslas can transfer data to and from the device concurrently, effectively doubling the maximum host-device bandwidth compared to non-Teslas.

Tiomat · August 11, 2014, 8:27am

Thanks for the responses, it pretty much matches what my gut feeling was. Hardware failures alone are not enough to justify the significant premium.

This then leaves the WDDM speed-boost. My gut feeling here is that it may give up to a 20% speed-up since we use Windows 7, which again is probably not sufficient to justify the cost since you can simply get two GeForce cards and parallelize the problem over multiple cards.

EEC is probably the wild card for me. Currently a bit flip is a minor annoyance with the odd reconstruction failing (NaNs everywhere), but that can be detected after the fact and the pass simply redone. This said in future we may be heading into markets where accuracy is more important. I think the likely outcome is staying with the Titans until someone throws a lot of money at us.

Topic		Replies	Views
Why Tesla? CUDA Programming and Performance	27	33772	November 20, 2008
Tesla 20-Series Features and Advantages CUDA Programming and Performance	65	152138	December 21, 2010
Tesla vs GeForce archs What makes the tesla better? CUDA Programming and Performance	8	18365	September 14, 2009
Tesla vs gtx CUDA Programming and Performance	4	4131	June 15, 2015
Why is Tesla expensive? CUDA Programming and Performance	9	12568	December 17, 2009
Real differences between GeForce GTX and Tesla Is there more than what's stated on the specs pag CUDA Programming and Performance	7	21083	May 9, 2009
tesla-geforce?which card? why? simple questions CUDA Programming and Performance	1	2618	June 18, 2008
Seek advice on latest fermis CUDA Programming and Performance	14	1904	September 1, 2011
Why use Titan over K20 in non-cluster environment CUDA Programming and Performance	3	2823	June 2, 2013
Tesla GPU lifetime How long are they supposed to run? CUDA Programming and Performance	1	10318	April 8, 2011

Failure rates of GeForce vs Tesla

Related topics