Performance Difference between 8600GT and GTX280

kjulius · January 15, 2009, 2:18pm

Hi,

I am working on Monte Carlo simulation for a financial problem. It is claasified as ‘embarassingly parallel problems.’ All the thread
processors run the same codes without communication among them except at the last stage to add up the results.

My implementation using CUDA and running on GeForce 8600GT with 32 thread processors is about 3 times faster than
AMD 64 3200+ 2GHz processor CPU.

Questions:

If I use GTX280 with 240 thread processors with higher clock speeds and faster memory according to the spec sheet, how faster
would it run than 8600GT? More than 10 times?
When I programmed 8600GT, seemingly enoucuous nexted for loops (about 3 deep) whould compile but would not run without
any meaningful error messages. I figured that the compiled code is out of 8600GT’s parameter range as Data of Star Trek used to say.
When I unrolled the outer loop, it worked. Would this code without unrolling work in GTX280 which has higher spec than 8600GT?

Thanx for the answers in advance.

J.

Delfistyaosani · January 15, 2009, 2:48pm

It’s a bit difficult to answer your question directly, but for example, I have a GTX260 216 core version and it is abot 32 times faster than my C2D E8500 (gpu bruteforcer, 700M PPS vs 22M PPS).

Theoretically, if u get a GTX260 (about 270$) and overclock to 700Mhz (runs pretty well if fan speed is max,3K RPM) that will be 8.75 times faster than 8600GT and about 26 times faster than 3200+.

I wouldn’t advice to buy a GTX280, it has only 24 more cores (10%) and costs 20% more, if you really want an amazing system buy three GTX260-s (216 core versions), it costs about 780$ and is about 75 times faster than 3200+, if we multiply 3200±s price (60$ about, is this right?) on 75 will be 4500$ :D

seibert · January 15, 2009, 5:11pm

Depending on whether your code is FPU limited or memory bandwidth limited (many CUDA apps are actually the latter), your code will scale something like the ratio of [processors*clock rate] or the ratio of the theoretical memory bandwidths. Compute both ratios, and that will give a range of speedups. (Note that other things might run faster on the GTX 280, like slightly uncoalesced memory reads. These ratios will not predict that.)

Without knowing why the nested for loop code failed, this is hard to say (though the answer is likely “no”). Did you get any error message at all?

_Big_Mac · January 15, 2009, 5:23pm

If your kernel is memory bound (as most are), GTX 280 has about 6-7 times the peak bandwidth of an 8600 GT and you could estimate your speed-up as this factor.

Delfistyaosani · January 15, 2009, 5:45pm

Can you tell how many operations are required on the same data? I ask this because there are many registers (1024 as I know) per core so if u use PTX instead of CUDA an optimized combination may speed up your code 2-3 times, cuz you need no extra clocks to access a register, but memory needs 100+, this is not controllable from cuda, you’ll need to learn PTX and CUDA Driver API (I’m doing that at this time :D )

Whatever your code depends on, FPU power or mem bandwidth, 3xGTX260 will be much faster than 2xGTX280 :)

E.D_Riedijk · January 15, 2009, 6:48pm

I have never heard of anyone that managed to get a significant speedup by using ptx instead of normal cuda code on the forum. In ptx registers are used only once, it is ptxas that performs optimization (and is quite good at it).

The actual decision to start to use local memory instead of registers is performed by ptxas, so ptx code has the same problem.

ptx is a nice intermediate format (OpenCL code will also be compiled to ptx). And people that are a fan of pascal can write pascal to ptx compilers this way ;)

Delfistyaosani · January 15, 2009, 7:32pm

PTX? buut it is said that OpenCL will work on ATI cards too, as I know PTX doesn’t, how about that? ^_^

AndreiB · January 15, 2009, 8:06pm

kjulius, if you plan to run your program on consumer graphic cards (i.e. not Tesla), you should avoid big running time and try to fit within one second max. That’s because watchdog on Vista is 2 seconds. To keep workload reasonable for low-end and high-end cards it may be necessary to adjust number of blocks within a grid according to number of processors on installed card. If you follow those guidelines, your program will run on all cards, from low-enf 8600GT to high-end GTX280.

You will be surprised, but this is not controllable from PTX either. Register allocation happens after PTX generation, during ptxas phase, and I’m fairly sure it will rearrange things so that final result will be almost the same as with .cu files. No 2x-3x speedup ever.

Why don’t just switch on your brains? How CUDA code is compiled?

High-level OpenCL code will be compiled into PTX for NVIDIA cards. For ATI it will be compiled into IL. Rest will be done by driver. Or maybe you were dreaming about binary compatibility betwee all OpenCL-compliant cards?

Delfistyaosani · January 15, 2009, 10:38pm

Ok Ok, if it doesn’t matter then why is there .global and .reg definitions? If everything gets rearranged then there is no need of doing this :D I’ll test a few things about that and maybe a small performance comparison to CUDA :)

tmurray · January 15, 2009, 11:10pm

final register allocation is done after PTX–otherwise, how could you handle cards with different numbers of registers?

Delfistyaosani · January 15, 2009, 11:38pm

Gas got it finally: defining .reg, .global or any other in PTX is like a “wish list”, if a user configuration doesn’t violate physical limits then OK, else some types get changed, right?

P.S.

your lucky post External Image

kjulius · January 16, 2009, 3:30am

Thanks a lot for your response, gentlemen.

My program is currently targetting derivatives traders with commercial graphics card who expect the results interactively in seconds
rather than minutes.

Following the given advices, I shouldn’t worry too much about getting the last drop of juice out of the GPU and should try to take
advantage of what I can get easily in affordable graphics cards.

About the nested for loop, I will try to come up with simplified code that demonstrates the problem.

kjulius