Any future problems running GPUs for 12+ hours at a time? running cards for long periods of time whi

Hello everyone. I am currently running a new EVGA GTX 470 and an older 8800 GTX in my Windows 7 system. I use the cards for BOINC and other CUDA related tasks. The system is well ventilated and using the EVGA tool I keep a constant eye on their temperatures. I usually run both cards at or near 100% for 12-14 hours at a time 5-7 days a week. Both cards settle at about 80c (never seen them run hotter). I leave the fan control set at auto and based on what the graphs show the fans are running at about 40% at those temperatures.

Will there be any problems running these cards for this long at those temperatures? It looks like the high end for these cards are around 100c. When I pump up the fan speed I can get their temps down to 50-60c even under full work. (They seem to idle at about 40-50c with no work)

I’m guessing that running these cards at 80c for 12 hours a day 7 days a week should not cause any issues, but I wanted to check with the experts here first. If the temps are too high for that period of time I can always up the fan speed. (I would rather not lower the time I process on the cards.)

Thanks in advance!

Carl

Graphics chips are designed to run correctly at high temperatures, even at a little over 100C. As with all chips, over time, electromigration can cause failure, and this time is accelerated with higher temperatures. I believe the average time to failure is usually measured in the timeframe of a few years when running with production settings. So even running 12 hours a day every day, you can expect the card to last for a few years. Increasing the fan speed a little should improve the lifetime of your cards. But even with the default fan settings, there shouldn’t be any issues with your card for the foreseeable future. Just don’t start trying to up the core voltage of your card or anything like that.

Graphics chips are designed to run correctly at high temperatures, even at a little over 100C. As with all chips, over time, electromigration can cause failure, and this time is accelerated with higher temperatures. I believe the average time to failure is usually measured in the timeframe of a few years when running with production settings. So even running 12 hours a day every day, you can expect the card to last for a few years. Increasing the fan speed a little should improve the lifetime of your cards. But even with the default fan settings, there shouldn’t be any issues with your card for the foreseeable future. Just don’t start trying to up the core voltage of your card or anything like that.

I’ve run cards for several days at a time for weeks at 80-90C, and not seen any obvious problems. I did have a GTX 280 die, but with a sample size of only 10 cards (8800 GTX through GTX 470) and no control group, I can’t say that long usage at those temperatures has any reliability effects. You should be fine.

SPWorley has observed that GPUs running CUDA jobs usually draw less power (and therefore produce less heat) than when performing actual 3D rendering, possibly due to some areas of the GPU being inactive. It sounds like running OpenGL for several days is more dangerous than pure CUDA. :)

I’ve run cards for several days at a time for weeks at 80-90C, and not seen any obvious problems. I did have a GTX 280 die, but with a sample size of only 10 cards (8800 GTX through GTX 470) and no control group, I can’t say that long usage at those temperatures has any reliability effects. You should be fine.

SPWorley has observed that GPUs running CUDA jobs usually draw less power (and therefore produce less heat) than when performing actual 3D rendering, possibly due to some areas of the GPU being inactive. It sounds like running OpenGL for several days is more dangerous than pure CUDA. :)

I did talk to a system builder who was incredulous that you’d ever run consumer cards 24/7. His point was that failures were rare, but clearly way more common if stressed with 24/7 use. The professional cards (Tesla) run cooler (they have lower clocks), and have been pre-tested more thoroughly, so you’ll have many many fewer failures than consumer cards when stressed like that.

The interesting point was the economics of board failure. One dead board will crash a production machine, perhaps it’s on one machine in a big server farm. But now you just interfered with whatever compute you’re doing and much throughput computes aren’t robust to failure, and often need manual intervention, and/or can cause the whole server farm to wait, so it becomes a big reliability issue since basically the one board failure can cause big issues with your entire farm. And that’s with a catastrophic failure… a transient failure is even worse.

And finally, if you do have a failure, you now need the skill and experience to find and replace the bad board (which seems trivial, but it’s a problem as machine counts grow!) and that replacement costs money and time.
So his final point was that it was well worth paying the significant extra cost for a Tesla for the simple reason that the higher MTBF reduced these stresses and costs.

This isn’t exactly what the original poster was asking but I mention it here because it’s interesting. Frankly I haven’t had any problem with using consumer cards 24/7. My only hardware failures were with a bad UPS and with a bad PSU.

I did talk to a system builder who was incredulous that you’d ever run consumer cards 24/7. His point was that failures were rare, but clearly way more common if stressed with 24/7 use. The professional cards (Tesla) run cooler (they have lower clocks), and have been pre-tested more thoroughly, so you’ll have many many fewer failures than consumer cards when stressed like that.

The interesting point was the economics of board failure. One dead board will crash a production machine, perhaps it’s on one machine in a big server farm. But now you just interfered with whatever compute you’re doing and much throughput computes aren’t robust to failure, and often need manual intervention, and/or can cause the whole server farm to wait, so it becomes a big reliability issue since basically the one board failure can cause big issues with your entire farm. And that’s with a catastrophic failure… a transient failure is even worse.

And finally, if you do have a failure, you now need the skill and experience to find and replace the bad board (which seems trivial, but it’s a problem as machine counts grow!) and that replacement costs money and time.
So his final point was that it was well worth paying the significant extra cost for a Tesla for the simple reason that the higher MTBF reduced these stresses and costs.

This isn’t exactly what the original poster was asking but I mention it here because it’s interesting. Frankly I haven’t had any problem with using consumer cards 24/7. My only hardware failures were with a bad UPS and with a bad PSU.

So which do you think would cause failure first? Cranking up the fan speed to lower the core temps which in theory could make the fans fail sooner, or letting the cores run a little hotter (Currently mine are at 60c @3500 RPM at 70% load, which seems like the perfect spot for them) and saving the bearings in the fans?

So which do you think would cause failure first? Cranking up the fan speed to lower the core temps which in theory could make the fans fail sooner, or letting the cores run a little hotter (Currently mine are at 60c @3500 RPM at 70% load, which seems like the perfect spot for them) and saving the bearings in the fans?

Working for a company who have requirements of apps running for months (years), of 24/7 high load computation…

The only advice I can give you, if you HAVE to use a consumer card - is under clock… heat is your biggest issue in terms of the core chip, even if the chip is rated to run at a core temp of 140C+ - you won’t want it running that hot…

Electromigration aside - the RAM, the PCB, heatsink/thermal padding/paste, etc - are all going to have other issues running at sigh high temperatures for extended periods of time - with the added issues of vibration that air cooling and a typical consumer card & case setups give - especially with low quality consumer parts…

In the end though, you’ll probably find your disk drives and PSU fail before your GPU - and you tend to replace consumer cards every year or two anyway - so I’d be more worried about the rest of your system having issues, before your GPU does…

Just my 2 cents (I’m sure there’s even more to consider)

Working for a company who have requirements of apps running for months (years), of 24/7 high load computation…

The only advice I can give you, if you HAVE to use a consumer card - is under clock… heat is your biggest issue in terms of the core chip, even if the chip is rated to run at a core temp of 140C+ - you won’t want it running that hot…

Electromigration aside - the RAM, the PCB, heatsink/thermal padding/paste, etc - are all going to have other issues running at sigh high temperatures for extended periods of time - with the added issues of vibration that air cooling and a typical consumer card & case setups give - especially with low quality consumer parts…

In the end though, you’ll probably find your disk drives and PSU fail before your GPU - and you tend to replace consumer cards every year or two anyway - so I’d be more worried about the rest of your system having issues, before your GPU does…

Just my 2 cents (I’m sure there’s even more to consider)

Thanks for the input. I’m running the cards now at 99% utilization for about 18hours then giving them a 6 hour break. The cards stabilize at about 65c with the fans at 70%. Those temps seem pretty good to me. :)

Thanks for the input. I’m running the cards now at 99% utilization for about 18hours then giving them a 6 hour break. The cards stabilize at about 65c with the fans at 70%. Those temps seem pretty good to me. :)

Another suggestion. NVidia’s latest Fermi cards have rather incredibly efficient coolers, much better than G200 cards (which are still very good!)
But those coolers are used by the firmware to keep the GPU’s QUIETER and not cooler. The cards really can take 90C for long periods of time, so the firmware lets the temp get that high before aggressively trying to increase the fan speed and therefore noise.

This means that if you increase airflow to your case (very important!), often you won’t see a decrease in GPU temps! Because the better case airflow just lets the GPUs run their own fans even slower and more quiet.

So to get lower GPU temps, the FIRST thing you need to do is edit your fan profile to aggressively push the GPUs to a lower base temperature, perhaps 75C or whatever. You will likely get low temps immediately, but you may also get a whiney shreek from the GPU fans which are suddenly no longer silent.

To get cool and quiet, you need better case cooling. You can’t overventilate too much! The best case is a whole new discussion, but I will strongly recommend the Silverstone FT02 case which is nearly ideal for 24/7 GPUs. It has large large fans, but most of all, a rotated motherboard to let GPU heat rise to vent out of the case.

I have two FT02 systems, each with 3 GPUs, right next to me now, both running at full load. I can hear them both but it’s not too distracting.
My fan profiles are still at the default 90C but I have lowered that in the past and finally returned to 90 for quiet.

I also have a third system with the excellent P183 case but it’s significantly louder than the FT02, even though this system has only 2 24/7 GPUs.

Another suggestion. NVidia’s latest Fermi cards have rather incredibly efficient coolers, much better than G200 cards (which are still very good!)
But those coolers are used by the firmware to keep the GPU’s QUIETER and not cooler. The cards really can take 90C for long periods of time, so the firmware lets the temp get that high before aggressively trying to increase the fan speed and therefore noise.

This means that if you increase airflow to your case (very important!), often you won’t see a decrease in GPU temps! Because the better case airflow just lets the GPUs run their own fans even slower and more quiet.

So to get lower GPU temps, the FIRST thing you need to do is edit your fan profile to aggressively push the GPUs to a lower base temperature, perhaps 75C or whatever. You will likely get low temps immediately, but you may also get a whiney shreek from the GPU fans which are suddenly no longer silent.

To get cool and quiet, you need better case cooling. You can’t overventilate too much! The best case is a whole new discussion, but I will strongly recommend the Silverstone FT02 case which is nearly ideal for 24/7 GPUs. It has large large fans, but most of all, a rotated motherboard to let GPU heat rise to vent out of the case.

I have two FT02 systems, each with 3 GPUs, right next to me now, both running at full load. I can hear them both but it’s not too distracting.
My fan profiles are still at the default 90C but I have lowered that in the past and finally returned to 90 for quiet.

I also have a third system with the excellent P183 case but it’s significantly louder than the FT02, even though this system has only 2 24/7 GPUs.

SPWorley- Can you refresh our memory on what mobo you are using in this setup? Thanks, V.

“…To get cool and quiet, you need better case cooling. You can’t overventilate too much! The best case is a whole new discussion, but I will strongly recommend the Silverstone FT02 case which is nearly ideal for 24/7 GPUs. It has large large fans, but most of all, a rotated motherboard to let GPU heat rise to vent out of the case…”

SPWorley- Can you refresh our memory on what mobo you are using in this setup? Thanks, V.

“…To get cool and quiet, you need better case cooling. You can’t overventilate too much! The best case is a whole new discussion, but I will strongly recommend the Silverstone FT02 case which is nearly ideal for 24/7 GPUs. It has large large fans, but most of all, a rotated motherboard to let GPU heat rise to vent out of the case…”

I’m running an XPS 720 plus a few more strategically placed fans I added later. The case cools very well. My proc under full load only hits about 118-120f. The fans on the cards are a bit loud, but not too much. I can deal with the fans at 70% to keep the cards in the mid to low 60c range. :)

It’s really amazing how powerful these cards are. I have my BOINC configuration feeding two work units to each card at the same time. I can knock out 8 SETI WU per hour! Just a few years ago it would take 18 HOURS to process a single WU! I guess those 448 cores really help. ;)

I would like to get water coolers for them in the future but I have not done it before and I don’t know if that will kill the EVGA lifetime warranty by taking off the plastic EE case and putting on a large copper water block. It would be so cool though to water cool them and really crank up their power.

On a side note, which setting affects CUDA applications the most? Core, shader or memory?

I’m running an XPS 720 plus a few more strategically placed fans I added later. The case cools very well. My proc under full load only hits about 118-120f. The fans on the cards are a bit loud, but not too much. I can deal with the fans at 70% to keep the cards in the mid to low 60c range. :)

It’s really amazing how powerful these cards are. I have my BOINC configuration feeding two work units to each card at the same time. I can knock out 8 SETI WU per hour! Just a few years ago it would take 18 HOURS to process a single WU! I guess those 448 cores really help. ;)

I would like to get water coolers for them in the future but I have not done it before and I don’t know if that will kill the EVGA lifetime warranty by taking off the plastic EE case and putting on a large copper water block. It would be so cool though to water cool them and really crank up their power.

On a side note, which setting affects CUDA applications the most? Core, shader or memory?

I use a variety. But I’d most recommend the very cheap MSI G65, which gives you 3 double-wide x16 PCIe slots and also includes embedded NV display GPU, so you don’t have to waste a full GPU for display. Pop in a cheap AMD hexacore, and you get a really efficient system.

I also have a couple systems based on the ASUS i7 MBs, but I now like the easier 3-GPU G65 motherboard.