GTX 590 CUDA power tests

The new dual-GPU GTX590 is still hard to find, but I did finally get one, and I’ll be ordering 4-5 more. My software scales well over multi-GPU, and it’s worked well with the previous GT200 dual GTX295. The specs of the GTX590 are also known, so I didn’t have any doubt about performance or compatability.

The biggest question I did have was about power use… the GTX590 has a hardware PCB design flaw which makes it extremely easy to fry itself. This is typically only when the crazy kids try to crank voltages too high, but the GTX590 is notable about its unprecedentedly high wattage and failure rate.

Still, as I’ve posted about before, we CUDA guys don’t really need to worry. CUDA apps use significantly lower wattage than graphics apps, so there should be plenty of wattage headroom.
The exact wattage used in CUDA apps varies of course, but as a rough approximation, a fully loaded CUDA app would use about 65% of the wattage of a fully loaded graphics apps. All reviews you ever read will be for and by the graphics guys who are optimizing frames-per-second of the latest eye-candy shooting games.

Still, those fried card reports show the GTX590 is likely the least robust GPU to use. Is that OK for CUDA? Do we actually reach any harsh limits?

Well, that’s what I measured of course. I have a system power meter (a Kill-O-Watt) so I can measure instantaneous full-system wall socket power draw. This doesn’t give you exact GPU wattages, but in some ways it’s even more useful since you can tell if your PSU is overloaded or whatever.

My benchmark (for both power and speed) is one of my own applications… a Monte Carlo integral equation computation over a large geometric database. One GTX480 runs my GPU app about 25 times faster than a optimized one-core CPU version. The computation usually takes many hours to run on the GPU, and my code is multi-GPU aware so it’s common to run 3 GPUs for 4 or 8 hour runs straight. The code is entirely GPU based so the CPU is idle during computation (other than dealing with zero-copy memory and minor kernel scheduling).
This is pretty much the most severe app I can run in CUDA… the GPU is going full speed for hours (or days!) at a time with no rest or mercy.

I’m testing on Linux (Ubuntu 10.10) using an older driver, 260.19.36. There are newer drivers, but these affect my video pretty badly, so I stuck with what worked. That said, there were some issues with the GPUs staying above idle rate long after computes have finished… those may or may not be fixed in future driver updates. But for wattage use at idle and at full load, I don’t expect the values to change. (There’s strong reason to use a more updated driver for power limiting, but again that’s for the crazy Furmark guys who like to stress VRMs)

My system specs shouldn’t matter too much, since we’ll be comparing the same machine after swapping out a single GTX480 with the new GTX590.
The motherboard is a ASUS P7T Revolution, the CPU is an i7-980X, 12GB of RAM. There is a GT240 GPU for display only.
In the first configuration, I’ll show a GTX480, then I swap to the GTX590 and run with one of the GTX590 GPUs active and then both active.

Wattages are measured from the wall socket, and therefore actual power use is lower (the PSU likely is 15% inefficient).
Remember these are wattages for the entire system, so it includes CPU, drives, display card, everything. But we can easily see the GPU differences during load (where the GPU is the only change in the system load.)

Wattage of the GPUs typically increases with time, likely from the GPUs heating up… the GPUs themselves seem to use more power when hot (and the fan likely uses a few watts).

Base run. GTX 480.
Initial idle: 181 watts. GPU at 37 degrees C.
At application start: 308 Watts. Temp 39 degrees C.
At application end: 331 Watts. Temp 66 degrees C.
GPU run time: 502 seconds.

GTX 590, using just one of the onboard GPUs.
Initial idle: 186 watts. GPUs at 36 and 34 degrees.
Application start: 285 Watts. 40 and 34 degrees.
Application end: 310 Watts. 77 and 36 degrees.
Run time: 542 seconds

GTX 590, using both onboard GPUs simultaneously
Idle: 186 watts. GPUs at 36 and 34 degrees.
Application start: 400 Watts. 40 and 40 degrees.
Application end: 428 Watts. 80 and 75 degrees.
Run time: 272 seconds

Conclusions:

Speed wise, the GTX590 is just as fast in CUDA as expected from its core count and clock rates.
The GTX480 has 480 cores at 1.4GHz, the GTX 590 has 2 GPUs each at 512 cores and 1.26 GHz (for the EVGA Classified GTX590 I have).
The run times I got show the same ratio… the GTX590 is about 7% slower than the GTX480 (from cores alone you’d expect 5% slower, but the GTX590’s RAM is slower too.)
As expected, using dual GPUs of the GTX590 is twice as fast as a single GPU. That’s more an measure of my CUDA code’s efficiency, but it’s good to see there’s no unexpected hardware slowdown.
These speed measurements are not a surprise.

Power use was the big unknown and the results are quite pleasant. We CUDA folk can smile and feel smug.
Idle power wasn’t measured directly, but it’s still quite low, only 5 Watts more than a GTX480. Loaded, using a single GPU of the GTX590 consumes roughly the same wattage as the GTX480.
Using both GPUs at full stress uses about 242 Watts more than the card than idle. Assuming idle is about 30 Watts (I did not measure this) it means the full GTX590 is using about 275 Watts at full load when running both GPUs. Even this value is too large since that’s wall socket power, so actual GPU use is likely around 250 watts at full load. This is a very very large and comfortable margin compared to the 375 Watt design limit given by card’s maximum PCIE power input.

Summary: The dual GTX 590 works in CUDA as expected. It’s great. For CUDA apps the card won’t use more than 275 Watts even for intense and sustained loads.

I can’t imagine why this would be. What’s so different about CUDA that all shaders running at full power would consume only 65% of the wattage of a video game? Video games do tend to take full use of bilinear interpolation in texturing units, which are normally idle in CUDA, but would that lower the performance by 35%?

I need to run some tests with my card.

I can’t imagine why this would be, either. Isn’t 590 just a downclocked 2x 580? In that case, there should be a lot of spare OC capacity. I’d expect it to OC the core to 800 MHz at stock voltage without breaking a sweat. Unless it has a deficient cooler design (spreading 400 watt is not an easy task.)

It’s likely the rasterizers, which are completely unused in CUDA. The “furmark” torture test is clearly the harshest way to make a GPU heat up, and it basically consists of throwing as many tiny polygons at the GPU as fast as possible. There’s little or even no shading or texturing.

The GTX590 don’t overclock well not because of the GPU chips, but because of the PCB board design. The GPUs work great! What’s failing is the PCB’s VRM chips (the small square chips on the board that convert the input 12V down to the 1 V needed) which can’t handle the power draw. The GTX590 has fewer VRMs per GPU than the GTX580, likely because of simple space reasons on the PCB. The overclocking guys use higher voltage and wattage, causing more stress on the VRMs. With only 5 VRMs on the GTX590, they blow, often with smoke.

The GTX590 has 5 VRMs per GPU, each rated at 35 amps. (meaning 35 watts at 1.0 volt), 10 VRMs, 350 watts maximum. The stock GTX580 has 8 VRMs but some custom PCB designs have up to 16 VRMs. (though I don’t know their current rating).

In general, more VRMs give “smoother” power, but the power limit is defined by the product of the VRM count, VRM current rating, and voltage.

Very interesting.

Any observations on the “intra-GPU” GTX 590 bandwidth? Is it getting full 16-lane PCIe 2.0 bandwidth from the onboard switch or less?

I think testing this may require the 4.0 P2P features to be working so that’s a good question as well: does P2P work?

Yes, interesting. Those interested in a 590 should probably wait for custom designs with more VRMs.

In addition to the rasterizer, there’s also something called “polymorph engine”, and it seems to be unused in CUDA too. I looked through online sources, no one says that raster engines and polymorph engines represent any significant part of the chip, but no one denies it, either.

My conclusion is exactly the opposite. For multi-GPU CUDA apps, the GTX590 is terrific as-is. Those power measurements are direct evidence of the fact that CUDA is unaffected by the wattage limits of the GTX590 PCB VRM issues.

For gaming? I’d probably skip the GTX590, but not for CUDA.

I’m ordering many more boards. They’re just hard to get now.

Hi SPWorley,
Sorry for the stupid question, but how did you manage to run CUDA on GTX590 with these old drivers which not support GTX590?
I ask because I have problem to run two of them with 270.40 and 270.41 drivers under Linux - they work ONLY at maximum 553Mhz speed (performance level 2).

I used the 270.36 drivers since the 270.40 corrupted my video (even with the GTX 480). Those are the CUDA 4.0 RC1 drivers.

In .36 the GTX590 is identified as “Device Emulation (CPU)” by CUDA properties, but it works in CUDA apps.

NVIDIA-settings just labels it as “unknown”.

There is definitely some runlevel problem, though. Both GPUs start at runlevel 0 (super-idle). When given a load, both properly go up to runlevel 3 (1260 MHz) and speed timing shows they’re really at that speed.

After the load is finished, the runlevels drop, but sometimes only to runlevel 1 or 2, even after 20 minutes. That’s obviously a bug, but again I’m not concerned by it since I’m using old drivers.

The other runlevel bug occurs when I put a load on one GPU but not the other… often the second GPU is also affected, and it’s not symmetric. (Load on GPU 0 doesn’t affect GPU1, load on GPU1 will boost GPU0 to runlevel 2 and it will stick there.)

There’s still no certified Linux drivers for the GTX590, even beta. 260.19.44 came out on March 7.

Thanks for the answer! I will try with these drivers too, but suppose that in this case will not have fan control. Actually I have only 270.35 drivers with RC1, but presumably it will be the same.

270.40-41 work well, full speed and no issues with the performance levels, with a single GTX590 but if I use two cards (4GPUs) then I can’t reach level 3, i.e. they are both limited by some reason to 553Mhz core speed.

I don’t know what could be the problem except a driver issue …if someone has other ideas I will be thankful?

It is sad that there are no official, even beta, drivers. I have the cards since 29th of March, but was not able yet to test them adequately…

Thanks for posting these results. I have done some Kill-A-Watt measurements of my own on a quad-480 box I have. One thing I have found is that the power measurements vary hugely depending on which CUDA application you run. One of my applications is a cross-correlation computation used for real-time astronomy signal processing. These achieves 79% of peak performance on a 480. Would you be willing to benchmark this application running on your 590s and report the power usage? This application is multi-GPU aware, and scales trivially with the number of processors.

FYI: radio astronomy is an extremely power constrained application, where one really cares about power usage because of the typical remoteness of the observing telescope (see one of our previous papers for an example of how we do supercomputing on a diesel generator).

Hmmm I’ve written som simple apps that achieve up to 95-98% on my 460m laptop. Perhaps i could post these and you could give me your peak Kill-A-Watt power utilization readings?

Sure, I’ll be happy to run power tests as long as they compile and run on Linux. Just give me instructions. I agree, it’d be nice to see if there’s a different CUDA efficiency for different workloads. Who knows, maybe
heavy transendentals are different power than multiplies which are different than shared memory writes which are different than global memory copies. They probably are different (especially memory) but I suspect it’s not a big difference.

Email me at gpu_power@jamasaru.com (an email address I just set up for this, so when spammers scrape it from this forum, it won’t matter.)

I predict ahead of time that if the GPU is at 100%, the main difference in power use will be on whether you’re using CPU at the same time or not. Let’s test that prediction!

And finally, I’d recommend to anyone doing CUDA development to buy a Kill-o-watt. It’s less than $25, and even though you’ll only use it a few times a year, it’s still invaluable!

Our new computer with 4 GTX 580 will soon arrive. I was thinking to order GTX 590, but I was affraid of the power problem, so I went for GTX 580. If in CUDA mode a GTX 590 consumes 275W max, it means that with a 1500W power supply you can use four of them.

When I stacked GTX480s, temperature went to 93 degrees and the noise was too loud. For GTX 580 the temperature is 87 degrees and the noise is not a problem anymore. When you will have the other GTX 590 boards, please report the temperature if you stack them. I remember reading that for gaming this is not possible because the cards are heating too much.

What program are you using for measuring the load of the GPU?

In the power measurements that we did some time ago (paper here), global memory accesses accounted for most of the power consumption. The kind of arithmetic instruction executed did not matter as much as their throughput (so register-register MOVs were burning more power than MADs as they can run on both execution pipelines.)

And THAT’s the kind of research that’s great to see! I like how you went down to the instruction level. An interesting question is looking at it on an app level (which you covered by doing classic BLAS multiplies) but for more apps ranging from folding@home to HOOMD to raytracing, etc.

Fermi’s behavior may also be different both from hardware, but perhaps more because of the caches, which should reduce global bandwidth tests.

Sylvain, thanks for the paper, and for that work you did! So our app-level power tests will be followups to that initial research you did.

Sure! But I haven’t been successful with getting any more boards yet. So it may be a bit.

I am not measuring my GPU load at all. I know it’s 100% since my kernel itself is a persistent kernel. A single call runs for hours, so it’s always computing with no launch or queuing overhead or anything.

Just because your kernel is running doesn’t mean it’s running at 100%, it could be stalled for any number of reasons (which incidentally, a GPU usage monitor wouldn’t show accurately either).

As someone said earlier, Reg to Reg MOV instructions likely are the worst power consumers, so that’s probably what the worst-case scenario is for CUDA and someone should write a CUDA kernel that does nothing but that to gather power consumption numbers.

I should also note that it isn’t the VRMs that are blowing up on the GTX590s, as widely believed because the same “snap, crackle, pop” happened with the GTX570 (which did actually suffer VRM failures).

Nonetheless, if you dont overclock either model of cards, the chances of something blowing up is unlikely.

I sent you some code on above address. Let me know how it runs :)

No problem. I ran it on one of the two GPUs on the board.

I added a UNIX timer to the code. I boosted the internal iterations from 100 to 1000 just so it’d run a little longer and the power would stabilize. (That didn’t change your program’s output).

Results under both sm_13 and sm_20 were pretty much the same, maybe sm_13 was 1% faster… within measurement noise.

Output for sm_20 was:

Time: 25.580 ms

Performance: 1259.275 GFLOPS

Bandwidth: 20.988 GB/s

Error: no error

Under sm_13:

Time: 25.300 ms

Performance: 1273.212 GFLOPS

Bandwidth: 21.220 GB/s

Idle watts: 190. Load watts: 280.

So your app’s wattage over idle was 90 watts.

Your app is also splnlocking the CPU, so some of that power is CPU use.

In comparison, my app used 100 watts with no CPU load.

The test kernel Jimmy sent me is mostly a tight loop doing some floating point accumulation.

My armchair analysis is that the kernel is dominated by madds and not memory. As Sylvain showed us, those are lower power than memory access operations.

Big caveat: I used toolkit 3.1 for this (not even 3.2), since 3.2 doesn’t seem to like this (simple!) code with these .36 drivers and the GTX590. I’d get an unknown error when mallocing… strange. But

SPWorley,

I find your numbers and conclusions quite interesting and useful. Thanks for conducting this research and posting the results!

I have a question about the choice of hardware you made, from the power consumption and dissipation perspective. I have no experience with dual-GPU cards.

I would think, that if one wants to plug as many Fermi cards into the system as possible, he would use water-cooled cards, such as EVGA GeForce GTX 590 Classified Hydro Copper Quad SLI (2 pack). I presume, that in such a setup, it’s possible to have 4 water-cooled GTX590s sitting adjacent to each other, provided the motherboard offers enough PCI-E slots (well, EVGA only sells two per household, but let’s skip this issue for now). I presume, that placing air-cooled GTX590s next to each other might fry them, due to the fan being in the middle of that card and hence unable to grab enough air for cooling.

Of course, the above setup involves the complexity of water-cooling, and most likely two power supplies. But, if your application is light on CPU, heavy on GPU, that might still be cost-effective, especially if you factor-in the cost of the human wait.

Are there any reasons you chose not to go with the water-cooled setup?

Thanks!