Any future problems running GPUs for 12+ hours at a time? running cards for long periods of time whi

I use a variety. But I’d most recommend the very cheap MSI G65, which gives you 3 double-wide x16 PCIe slots and also includes embedded NV display GPU, so you don’t have to waste a full GPU for display. Pop in a cheap AMD hexacore, and you get a really efficient system.

I also have a couple systems based on the ASUS i7 MBs, but I now like the easier 3-GPU G65 motherboard.

This certainly makes sense with a large deployment and if the cost of downtime is high. For smaller scale academic applications the reliability features of Tesla are seldom worth the cost by themselves. I could burn out a GTX 470 every 6 months and still be financially ahead of a Tesla C2050 with a deployment lifetime of 3 years.

Incidentally, I totally believe the statement about one bad device taking out a large cluster. That would explain the terrible uptimes I’ve seen in most > 500 node systems I’ve worked with. Really, it demonstrates that we need come up with ways build more fault-tolerance into our distributed compute jobs. Having to assume flawless performance of thousands of devices is driving up the cost of HPC. :)

This certainly makes sense with a large deployment and if the cost of downtime is high. For smaller scale academic applications the reliability features of Tesla are seldom worth the cost by themselves. I could burn out a GTX 470 every 6 months and still be financially ahead of a Tesla C2050 with a deployment lifetime of 3 years.

Incidentally, I totally believe the statement about one bad device taking out a large cluster. That would explain the terrible uptimes I’ve seen in most > 500 node systems I’ve worked with. Really, it demonstrates that we need come up with ways build more fault-tolerance into our distributed compute jobs. Having to assume flawless performance of thousands of devices is driving up the cost of HPC. :)

Hey, Just for knowledge sake – how do you control the speed of GPU fans and temperature? I could not find any such options in nvidia-smi tool. I am not graphics savvy… so, pls bear with me,

Thanks for any info,

Hey, Just for knowledge sake – how do you control the speed of GPU fans and temperature? I could not find any such options in nvidia-smi tool. I am not graphics savvy… so, pls bear with me,

Thanks for any info,

I first installed the EVGA Precision software. After starting the software I double click the display in the middle to break it out into the larger floating display. If you then right click on the floating display there is a tab in there for fan profile. It’s actually pretty cool, you can click and add points and drag the line to make your own cooling profile. I set mine up more logarithmic so once the temps get about 65c or so the fans ramp up in speed quickly. So far the cards never get above 72c and run about 90% of the time at 65c under 99% load.

Another thing I noticed is with the EE form factor EVGA cards the case temp does not affect the card temp as dramatically since the cards are completely enclosed and pass the heat out of the back. Obviously if the internal air temp the cards are drawing in is cooler the cards will be slightly cooler but I have not noticed as large of a difference compared to other cards I have had in the past where all of the airflow stays in the case.

I first installed the EVGA Precision software. After starting the software I double click the display in the middle to break it out into the larger floating display. If you then right click on the floating display there is a tab in there for fan profile. It’s actually pretty cool, you can click and add points and drag the line to make your own cooling profile. I set mine up more logarithmic so once the temps get about 65c or so the fans ramp up in speed quickly. So far the cards never get above 72c and run about 90% of the time at 65c under 99% load.

Another thing I noticed is with the EE form factor EVGA cards the case temp does not affect the card temp as dramatically since the cards are completely enclosed and pass the heat out of the back. Obviously if the internal air temp the cards are drawing in is cooler the cards will be slightly cooler but I have not noticed as large of a difference compared to other cards I have had in the past where all of the airflow stays in the case.

I am sceptical about higher fan speed because of dust: More air throughput => More dust transported => cooling fins clogged by dust sooner.
If the fan runs slower, the “dust collection rate” is also lower(i assume).
Once the cooler is completely blocked, not even 100% fan speed can cool the card under full load.

(If you can operate your cards in a dust-filtered environment or can clean them occasionally that is not a problem, of course)
But for an unserviced box “out in the wild”, i am not sure if an increased fan speed will increase MTBF. (What fails earlier on average, electronics because of temperature or cooler because of dust?)

I am sceptical about higher fan speed because of dust: More air throughput => More dust transported => cooling fins clogged by dust sooner.
If the fan runs slower, the “dust collection rate” is also lower(i assume).
Once the cooler is completely blocked, not even 100% fan speed can cool the card under full load.

(If you can operate your cards in a dust-filtered environment or can clean them occasionally that is not a problem, of course)
But for an unserviced box “out in the wild”, i am not sure if an increased fan speed will increase MTBF. (What fails earlier on average, electronics because of temperature or cooler because of dust?)

This is a good point. Google’s published a few papers on hard drive failure rates as a function of temperature, and they’ve found in their data centers that hard drives can tolerate higher temperatures than “conventional IT wisdom” would recommend. I’d be curious about other components as well.

This is a good point. Google’s published a few papers on hard drive failure rates as a function of temperature, and they’ve found in their data centers that hard drives can tolerate higher temperatures than “conventional IT wisdom” would recommend. I’d be curious about other components as well.

Thank you Carl! I was more looking at an NVIDIA Tool that can collect these statistics… An utility for controlling and monitoring the card!

Anyway, Thanks a lot!

Thank you Carl! I was more looking at an NVIDIA Tool that can collect these statistics… An utility for controlling and monitoring the card!

Anyway, Thanks a lot!