MTBF information?

A question to the people from NVIDIA or the people running clusters with CUDA (like Tachyon John)

Are there any MTBF numbers available for NVIDIA GPU’s or Tesla’s? Or do people have experience with large clusters and large numbers of GPU’s?
I am nearing the end of my evaluation of CUDA and will be proposing using CUDA in our products, but the first thing our Processing department will ask is what is the MTBF?

Myself I have experienced only bad Dell MB’s/Intel CPU’s, my 2 8800 GTX’s have been running happily, but that doesn’t tell me a lot statistically ;)

ok, What is MTBF?

Wikipedia is your friend :)

http://en.wikipedia.org/wiki/MTBF

We’ve had very good luck so far. At the moment, I’d have to say that our experience thus far is that software (ours, the linux kernel, and the GPU driver etc) has been the dominant factor in stability. So far we haven’t had any hardware issues that I’m aware of. My personal opinion is that software stability issues are likely to be dominant for most CUDA codes at the outset. The hardware issues will probably matter somewhat more for people doing long runs on larger GPU counts. I’ll have a lot more experience and likely more to say about this in about 8 weeks or so after we’ve been running some production jobs for some weeks on largish (32-64) GPU counts.

Cheers,

John

One more comment on my previous note: I should also mention that even after we’ve been running our code for lengthy periods, I probably won’t be able to provide a hard MTBF number from our experiences. I expect that the best I’ll be able to provide is anecdotal accounts of these things, since we’re not in direct control of the cluster hardware/software environment, and the system we’re using here at Illinois is shared by many other people. We’re getting our own small 8-node cluster up and running now, but that’s so much smaller I’ll be surprised if anything occurs on it except maybe a few cosmic ray hits to global memory over a month long period, and even then they may not affect our runs. We’ll just have to see how it goes. I’ll report our experiences once we’ve been at the large runs for a while.

Cheers,
John

How are you using the cluster of GPUs? Are they running kernels 24 hours a day, seven days a week?

Not quite yet, but that’s where things are headed. The biggest clusters we have access to are shared with many other people so we can’t hog the resource too much, but some of the jobs will soon be running 24/7, though perhaps on only half of the nodes. Another big GPU cluster we have access to is only available for nighttime runs, so we might run on as many as 80 on that system, but likely limited to 12 hours per day. I’ll be able to say more once the first big jobs are setup and runs commence on these systems. We’re still a couple weeks from starting, just a few code tidbits to deal with…

John

the devil is like always in the detail ;)

Looking very much forward to your experience. For my application it would be 1 GPU, running 24/7 for months in a row as far as I can see now. And at least 4 of these systems would be deployed. The GPU will also probably never be running at 100%, just in extreme circumstances. But that is what we have to design for.

I will be deploying systems somewhat like yours, Denis. Each system contains 4 Tesla GPUs, could be left on for long periods of time (depending on the customer’s particular needs), and don’t run at 100% all the time.

So far my testing with a prototype system using a Tesla S870 has been up for several weeks at a time and running the GPUs at 100% at various times during the day. I found the same thing John has: reliability has been limited more by software than hardware.

In my design, I explicitly free all GPU resources and call cudaThreadExit() on every GPU after each run completes - maybe it’s overkill, but I wanted to be sure I wasn’t bitten by a memory leak or other bug after a few weeks or months of uptime.

Sorry the info is anecdotal, but hopefully it helps somewhat.

Well, all info helps :) I was not expecting anything better than anecdotal actually, unless NVIDIA would have information.

As for me, I still have to convince some people internally to go the CUDA way. A switch to CUDA is a bit dangerous this late in our development cycle, so I am trying to fill all the holes in my knowledge to be well prepared ;)

Just make sure the GPUs are adequately cooled. I have some tests where a simulation ran for 150+ hours contiguously in an properly cooled environment and for less than 5 in a tight case with little airflow.

There are hardware bugs (this is my presumption: NVIDIA won’t confirm or deny the root cause) that prevent certain kernels from running for more than a few minutes, but judging by the forum traffic over the past year the triggering of this problem is very rare.

About official MTBF failure rates for the hardware: you could send a message to an NVIDIA rep like David Hoff (dhoff). NVIDIA should be able to provide these numbers, especially if it will help convince higher ups to make a purchasing decision :)

My 2 pixels:

For 24/7 over a long timespan, the F@H folks have quite some experience. Well, their cluster is based on R580s from AMD, but they’re quite content in terms of MTBF. Some language about this has been published here:

@article{Owens:2008:GC,

	title	=	"GPU Computing",

	journal	=	"Proceedings of the IEEE",

	author	=	"John D. Owens AND Mike Houston AND David Luebke AND Simon Green AND John E. Stone AND James C. Phillips ",

	year	=	"2008",

	month	=	may,

	keywords	=	"GPGPU, GPU computing, parallel computing",

	pages	=	"879--899",

	volume	=	"96",

	number	=	"5"

 }

My own cluster (4 8800 GTX’s and 8 dualcores in four nodes) is not used 24/7, but it has been running without any HW failures for more than a year now.

Another cluster I have worked on used >100 Quadro 1400s for more than 2.5 years now, and in that time, one GPU died. For reference: Almost 25% of the DDR RAM died in the same time…

For the architecture-aware readers, this paper is quite interesting:

@INPROCEEDINGS{Sheaffer:2007:AHR,

  author = {Jeremy W. Sheaffer and David P. Luebke and Kevin Skadron},

  title = {A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors},

  booktitle = {Graphics Hardware 2007},

  year = {2007},

  editor = {Timo Aila and Mark Segal},

  month = aug,

  pages = {55--64},

}

Very interesting titles and numbers. I’ll try to read the papers tomorrow at work. It confirms my feelings that if it works in tests, it will probably keep running, and other parts might fail earlier than the GPU.

@MrAnderson, I am indeed in contact with David Hoff, he is looking for some numbers. Although for me personally numbers from others are also very valuable. Especially when compare to RAM ;) I know about your experience with cooling, I have it noted in capitals to make sure everything is adequately cooled.

Is Nvidia going to require you to sign an NDA or is there some chance David Hoff could publish those numbers on the forum?

In my experience, even the “big iron” supercomputer vendors often make you sign NDAs to talk about such things. I think it’d be great if this info were made publically available, but I also won’t be surprised if not. That said, I would expect software issues to be at least 10x as big of a concern as hardware, at present.

John Stone

I also would not be surprised if that would be the case. If you make one public, the competition will make a better number public. Better to get a more accurate number under NDA…

Oh Thanks! Seibert too is my friend…