GTX 590 CUDA power tests

Interesting product! It seems to me though that it’s not possible to install more than two of them back to back (at least not without custom made plumbing), as there wouldn’t be enough space around the middle cards to attach the water pipes.

Even with 3 GTX295s or 3 GTX480s, air cooling still worked. I don’t know about the GTX590 since I only have one, but its center fan and airflow design is very similar to the GTX295’s.

But even so, I’ll be leery of water cooling for my work. I have a hard enough time when I visit a studio and say “Just plug this card into your PC”… imagine if I said "well, we need to change your PC

entirely and run water pumps inside." Businesses don’t like hardware difficulties and getting them to plug in a single GPU into a machine is hassle enough.

If I were running a giant cluster, maybe it’d be worth thinking about. But that must have issues too… the Tesla-powered Tsubame supercomputer is air cooled, so there must be practical reasons not to choose water in their design too.

I think the fans are also designed to pull air from one end of the card to the other, so blocking the middle, while limiting flow, is not a disaster. I have 4 cards (3 GTX 295 + 1 GTX 470) packed in such an arrangement with air forced into one end of the card using three 120 mm case fans at close range. The system runs hot, but not dangerously so.

Rather than pay the premium for water cooling, you could put the same money in the bank and treat it as an insurance fund for replacing a card if it breaks. :)

The HydroCopper setup from EVGA adds $150 per card, and you can assemble the rest of the system together for about $100. So it’s not much of an insurance fund - barely enough to replace one card.

Cooling 3 GTX 295 + 1 GTX 470 is no easy feat, but that still does not come close to the heat output of 4 590’s (there are electric tea kettles in the U.S. that produce less heat than 4 590’s at full load), and blowing all that heat towards the CPU and the motherboard does not seem very safe to me. I’d never dare to run that kind of setup without water cooling.

FYI: In US EVGA offers free lifetime warranty on its high-end cards. To activate you only need to register the new card within 30 days.

This makes sense. I wasn’t aware of your business constraint.

Yes it’s basically doing just mad operations while also some global memory loads. It’s meant to just keep the FPUs working.

Sylvains paper is now on my reading stack :)

Strange about the 3.1/3.2, I’ve been able to run it with 3.2 without any hitches on the 460m. Must be something with the 590 as you suggested.

Thanks for giving it a run. Running at 280 watts during such a heavy load yields 280/365 => 76% of peak which is very interesting for me… What is the peak wattage during load that you have achieved so far ?

I’m not sure it is wise to put 4x GTX590 not because of power/cooling/density issue (we have actually have 4x C2070 in a machine here at work), but because of the lack of PCI-E lanes.

Nehalem only supplies 32 PCI-E 2.0 lanes (and Sandy Bridge even worse at 16 PCI-E 2.0 lanes) and as I understand it, a large class of CUDA problems are PCI-E bus bandwidth limited.

Wow, this is really interesting piece of info, thanks for that. I own GTX590 as well and bought it primarily for work with Octane Render (CUDA based pathtracer). I wonder now, what your conclusion mean when it comes to overclocking. If the card draws less power than while playing games, there is a nice possibility, the card could be clocked higher as well (and be stable/not hindered by the OCP). The overclocking potential of 590 is not a thing of weak cores (they are the same as with 580), but its the matter of underpowered power cascade.

Sorry for basic question, but I’m a beginner in multiple devices subject: Is there any difference for CUDA programmer between one GTX 590 3 GB and two GTX 580 1.5 GB?

No difference to the programmer. Both will show up as two distinct CUDA devices with 1.5 GB of device memory. The pair of 580s will be faster due to a higher clock rate.

Thank you very much.

Oh, and the two GTX 590 devices will have to share a PCI-Express interface, which will slow down concurrent transfers to both cards. A pair of GTX 580s can be spaced out to connect to independent PCI-Express lanes, assuming the motherboard supports that. Again, this isn’t a programming difference, but just a minor performance difference.

Just as a small followup, I tested the idle wattage of the GTX590. It’s 42 watts, measured from the wall, so allowing for PSU inefficiencies it’s probably about 30 to 35 watts.

I’d also like to chime in my thanks to SPWorley for this great research and information. Info like this is not very plentiful.

The results of the first post were encouraging, and I just pulled the trigger on a dozen 590s for twin research machines. I will be dropping 6 a piece in two Tyan FT77B GPU chassis! This should be interesting, and I will post some results when I have them.

Thought you might find this interesting: http://www.hpcsweden.se/files/Actual_power_consumption_in_Pattern_Matching_on_CUDA_GPUs

Power was measured with an PCI extender and an attached multimeter.

Well this is certainly interesting. And a bit troubling, since their efficiency scaling plot shows that their occupancy on the GPU only ever hits 40% max. This after claiming to test “extreme bandwidth and compute bound” regimes… they could have easily used the SDK nbody benchmark routine to look at full occupancy kernels.

All of the code I run is compute bound, and the kernels generally achieve over 95% occupancy on the GPU. I’m really curious about the occupancy that SPWorley’s code was giving during his tests.

Occupancy is not the same as utilization which i think is what you are actually referring to ?

And yes they did test very compute bound applications from the SDK , such as the nbody problem, those results are shown briefly to the left under “bandwidth and compute benchmark”. As you can see they reach up to 9.3 GFLOPS/watt on the GTX 560 which is better than the theoretical GFLOPS / watt of that card…

The pattern matching algorithm results are shown to the right, as you can see the power consumption there is quite constant irrespective of the utilization increase.

And you can see that the efficiency only increases as the utilization is increased, which bodes very well for your compute bound code :)

Yes I see what you mean, Jimmy, and yes I do mean utilization. I should have been more accurate, and I should have looked a bit longer at the slide - still I’d like to have seen the data points at those high utilization thresholds too. Thanks for the reply.

My example code was close to 100% utilization… the Monte Carlo computations it’s doing are all independent. I launch a single kernel that does all the work, so that single kernel will often run for many hours at a time. Even inside the kernel, warps don’t need to sync often.

This made it a really good test since the idle CPU means there was no launch overheads or CPU power beyond idle, so it was a good isolation of the GPU’s power use.