GTX 260 v. 8800 GT benchmarking with cublasSgemm

My new GTX 260 (EVGA superclocked) arrived today and I was able to benchmark it against the same application on a 8800GT (1GB Palit). The application uses mostly cublasSgemm, with a few of the other cublas routines.

I was expecting something like a factor of 2 improvement in computation time (more processors, greater bandwidth), but got only about 33% improvement. Is this marginal improvement a product of the lower core clock rate of the 260? Or perhaps cublas hasn’t yet been tuned up for the GTX 260? I’ll take 33%, but I’m a bit disappointed.

My on-board networking seems to have gone out with the hardware upgrade; I suspect my power supply is marginal (although I think it meets the bare minimum requirement). The “card”, speaking loosely in calling it a card - its more like a box, is indeed rather large. At 10.5" long it barely fits in my case!

You may find that you need to retune your block and thread sizes to get better performance on the G200… more registers, and better thread scheduler tends to bias best performance with more threads per block on the G200.

From raw ALU power, the GX 260 is about 40% more than the 8800GT.
(112 @ 1.5GHz versus 192 @1.24)… not exactly apples to apples but close.

Your actual speedup will of course depend on where your bottlenecks are… there are a dozen possibilities ranging from PCI bandwidth to atomic instruction use to local memory bank contention to host CPU speed.

I installed a small plutonium power plant into my computer (i.e., I upgraded to 650 W), and my system recovered its networking. Ergo, a 500-W Antec Earthwatts powersupply is insufficient for a gtx 260. (Luckily I’ve been doing this sort of nonsense for a long time and recognized the issue right off!)

The timing was unchanged - each step of my calculation took 0.500 s before on the 8800GT, now takes 0.375 s on the GTX 260, to be specific. I spoke loosely before of expecting of a factor of 2 speed up (50%), so 40% speed up is in the ball park and not that far from 33%. I’m also willing to believe that the CUDA BLAS have not yet been quite tuned up for the GTX 260/280. In calling cublasSgemm, I don’t believe I have control of grid/blocks/threads, yes?

Now to keep my plutonium power plant from going critical…

It’s really true that not all PSUs are the same. The wattage rating on most PSUs seems to be marketing and not rigidly defined… see the fun but dated article http://www.tomshardware.com/reviews/stress-test,1073.html.

I put a new GX280 into my aging desktop (not my main machine, but the only one that could hold the large card.) Its Seasonic PSU was only 500W, which is NOT enough… and worse, I played games with cable adapters to get the 8-pin supply.

Yet it worked fine, and it’s running at 100% GPU load 24/7 even now, doing some number theory searches when not rendering scenes. I’m impressed… but I was lucky.

I am planning to buy either a GTX260 or 2x8800 GT … I wonder whether my Antec Earthwatts 500W can take the load!

No, you’re mixing directions. It’s either 50%, 30%, 25% or 100%, 40%, 33%. So although your speedup is close to the theoretical, it’s nowhere near what you expected.

Anyway, I just came here to say we all need to quit using percents and switch to logarithms. A logarithm doesn’t change depending on which direction from 1 you’re calculating. It’s also easier to compare. 50% speedup for example is not five 10% speedups, as you should know. If you use the natural logarithm times 100, then it’s convenient because nothing changes for small values. 10% is approximately 10%e.

Watch out. You’ll stop being lucky when it finally starts to burn out, crashing your PC repeatedly until the hard drive is corrupted and the whole thing has to be reformatted. Then again 500W isn’t too little for an old PC with a less power-hungry CPU (it’s not a pentium 4, is it?). It all depends on the PSU’s quality, because the 500W number is in truth a marketing lie and means very little. (There’s no testing that determines an X watt PSU actually has to last N years running under the quoted load. For an unscrupulous company the number means more like, “well it doesn’t die right away at 500W” while for a good PSU it’s actually a practical figure that the engineers thought about and deemed sustainable given the components used. But either way, there’s usually not much testing.)

The 128 FMADD example provided elsewhere in the forums by Nvidia, (tweaked up to 512 threads) shows 555.85 GFLOP/s on my GTX280 with shader clock at 1.458GHz. This is ~80% of the max theoretical rate at 2 FLOP/clock.

I think the GTX280 is supposed to be able to get another add per cycle as well which leads to the headline rate of 933GFLOP/s at the stock 1.296GHz shader clock rate.

I don’t know how to incorporate this extra add into the FMADD example, perhaps an Nvidia engineer would be kind enough to update the benchmark to push cards of this family to the limit.

That “last flop” is hard to access. Its using the texture interpolation feature on texture reads… getting the texture hardware’s ALU involved. I don’t think anyone’s actually got code that shows off that 933 GFLOP theoretical peak… that’s more of a marketing number than a real value. ATI, Sony, and CPU vendors play similar theoretical games too.

On G200 the last flop is from a fused MADD+MUL. The texture interpolation “missing flops” is only on pre-G200 parts.

Presumably, one could modify the MADD synthetic benchmark to do a MADD + MULL every set of instructions to attain near the theoretical peak.