GTX 260 v. 8800 GT benchmarking with cublasSgemm

Boxed_Cylon · July 3, 2008, 2:16am

My new GTX 260 (EVGA superclocked) arrived today and I was able to benchmark it against the same application on a 8800GT (1GB Palit). The application uses mostly cublasSgemm, with a few of the other cublas routines.

I was expecting something like a factor of 2 improvement in computation time (more processors, greater bandwidth), but got only about 33% improvement. Is this marginal improvement a product of the lower core clock rate of the 260? Or perhaps cublas hasn’t yet been tuned up for the GTX 260? I’ll take 33%, but I’m a bit disappointed.

My on-board networking seems to have gone out with the hardware upgrade; I suspect my power supply is marginal (although I think it meets the bare minimum requirement). The “card”, speaking loosely in calling it a card - its more like a box, is indeed rather large. At 10.5" long it barely fits in my case!

SPWorley · July 3, 2008, 3:14am

You may find that you need to retune your block and thread sizes to get better performance on the G200… more registers, and better thread scheduler tends to bias best performance with more threads per block on the G200.

From raw ALU power, the GX 260 is about 40% more than the 8800GT.
(112 @ 1.5GHz versus 192 @1.24)… not exactly apples to apples but close.

Your actual speedup will of course depend on where your bottlenecks are… there are a dozen possibilities ranging from PCI bandwidth to atomic instruction use to local memory bank contention to host CPU speed.

Boxed_Cylon · July 3, 2008, 5:11am

I installed a small plutonium power plant into my computer (i.e., I upgraded to 650 W), and my system recovered its networking. Ergo, a 500-W Antec Earthwatts powersupply is insufficient for a gtx 260. (Luckily I’ve been doing this sort of nonsense for a long time and recognized the issue right off!)

The timing was unchanged - each step of my calculation took 0.500 s before on the 8800GT, now takes 0.375 s on the GTX 260, to be specific. I spoke loosely before of expecting of a factor of 2 speed up (50%), so 40% speed up is in the ball park and not that far from 33%. I’m also willing to believe that the CUDA BLAS have not yet been quite tuned up for the GTX 260/280. In calling cublasSgemm, I don’t believe I have control of grid/blocks/threads, yes?

Now to keep my plutonium power plant from going critical…

SPWorley · July 3, 2008, 5:24am

It’s really true that not all PSUs are the same. The wattage rating on most PSUs seems to be marketing and not rigidly defined… see the fun but dated article http://www.tomshardware.com/reviews/stress-test,1073.html.

I put a new GX280 into my aging desktop (not my main machine, but the only one that could hold the large card.) Its Seasonic PSU was only 500W, which is NOT enough… and worse, I played games with cable adapters to get the 8-pin supply.

Yet it worked fine, and it’s running at 100% GPU load 24/7 even now, doing some number theory searches when not rendering scenes. I’m impressed… but I was lucky.

ddutta · September 6, 2008, 4:31am

I am planning to buy either a GTX260 or 2x8800 GT … I wonder whether my Antec Earthwatts 500W can take the load!

alex_dubinsky · September 6, 2008, 7:57pm

No, you’re mixing directions. It’s either 50%, 30%, 25% or 100%, 40%, 33%. So although your speedup is close to the theoretical, it’s nowhere near what you expected.

Anyway, I just came here to say we all need to quit using percents and switch to logarithms. A logarithm doesn’t change depending on which direction from 1 you’re calculating. It’s also easier to compare. 50% speedup for example is not five 10% speedups, as you should know. If you use the natural logarithm times 100, then it’s convenient because nothing changes for small values. 10% is approximately 10%e.

alex_dubinsky · September 6, 2008, 8:00pm

Watch out. You’ll stop being lucky when it finally starts to burn out, crashing your PC repeatedly until the hard drive is corrupted and the whole thing has to be reformatted. Then again 500W isn’t too little for an old PC with a less power-hungry CPU (it’s not a pentium 4, is it?). It all depends on the PSU’s quality, because the 500W number is in truth a marketing lie and means very little. (There’s no testing that determines an X watt PSU actually has to last N years running under the quoted load. For an unscrupulous company the number means more like, “well it doesn’t die right away at 500W” while for a good PSU it’s actually a practical figure that the engineers thought about and deemed sustainable given the components used. But either way, there’s usually not much testing.)

aakova · September 7, 2008, 1:38am

The 128 FMADD example provided elsewhere in the forums by Nvidia, (tweaked up to 512 threads) shows 555.85 GFLOP/s on my GTX280 with shader clock at 1.458GHz. This is ~80% of the max theoretical rate at 2 FLOP/clock.

I think the GTX280 is supposed to be able to get another add per cycle as well which leads to the headline rate of 933GFLOP/s at the stock 1.296GHz shader clock rate.

I don’t know how to incorporate this extra add into the FMADD example, perhaps an Nvidia engineer would be kind enough to update the benchmark to push cards of this family to the limit.

SPWorley · September 7, 2008, 2:00am

That “last flop” is hard to access. Its using the texture interpolation feature on texture reads… getting the texture hardware’s ALU involved. I don’t think anyone’s actually got code that shows off that 933 GFLOP theoretical peak… that’s more of a marketing number than a real value. ATI, Sony, and CPU vendors play similar theoretical games too.

MisterAnderson42 · September 7, 2008, 2:22am

On G200 the last flop is from a fused MADD+MUL. The texture interpolation “missing flops” is only on pre-G200 parts.

Presumably, one could modify the MADD synthetic benchmark to do a MADD + MULL every set of instructions to attain near the theoretical peak.

Topic		Replies	Views
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26917	December 23, 2014
Best, bang-for-the-buck, CUDA platform? ... Which? 9800 GX2, Tesla C870, new 2xx ... CUDA Programming and Performance	23	10597	July 15, 2008
GTX260: arch improvements for CUDA? CUDA Programming and Performance	6	6117	September 7, 2008
GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48 CUDA Programming and Performance	15	15615	July 18, 2010
How to achieve 56 TFLOPS performance on RTX 500 Ada? CUDA Programming and Performance cuda	11	70	April 20, 2025
Chart GPU vs CPU CUDA Programming and Performance	11	13765	October 15, 2008
graphic card support card choices CUDA Programming and Performance	9	4611	July 18, 2008
my speedy SGEMM CUDA Programming and Performance	91	275905	May 29, 2013
GTX 460 CUDA Programming and Performance	58	60204	August 5, 2010
One powerful GPU vs. several low-end GPU's Which is better? For "embarassingly" parallel CUDA Programming and Performance	9	13104	March 25, 2010

GTX 260 v. 8800 GT benchmarking with cublasSgemm

Related topics