I am working on Monte Carlo simulation for a financial problem. It is claasified as ‘embarassingly parallel problems.’ All the thread
processors run the same codes without communication among them except at the last stage to add up the results.
My implementation using CUDA and running on GeForce 8600GT with 32 thread processors is about 3 times faster than
AMD 64 3200+ 2GHz processor CPU.
If I use GTX280 with 240 thread processors with higher clock speeds and faster memory according to the spec sheet, how faster
would it run than 8600GT? More than 10 times?
When I programmed 8600GT, seemingly enoucuous nexted for loops (about 3 deep) whould compile but would not run without
any meaningful error messages. I figured that the compiled code is out of 8600GT’s parameter range as Data of Star Trek used to say.
When I unrolled the outer loop, it worked. Would this code without unrolling work in GTX280 which has higher spec than 8600GT?
It’s a bit difficult to answer your question directly, but for example, I have a GTX260 216 core version and it is abot 32 times faster than my C2D E8500 (gpu bruteforcer, 700M PPS vs 22M PPS).
Theoretically, if u get a GTX260 (about 270$) and overclock to 700Mhz (runs pretty well if fan speed is max,3K RPM) that will be 8.75 times faster than 8600GT and about 26 times faster than 3200+.
I wouldn’t advice to buy a GTX280, it has only 24 more cores (10%) and costs 20% more, if you really want an amazing system buy three GTX260-s (216 core versions), it costs about 780$ and is about 75 times faster than 3200+, if we multiply 3200±s price (60$ about, is this right?) on 75 will be 4500$ :D
Depending on whether your code is FPU limited or memory bandwidth limited (many CUDA apps are actually the latter), your code will scale something like the ratio of [processors*clock rate] or the ratio of the theoretical memory bandwidths. Compute both ratios, and that will give a range of speedups. (Note that other things might run faster on the GTX 280, like slightly uncoalesced memory reads. These ratios will not predict that.)
Without knowing why the nested for loop code failed, this is hard to say (though the answer is likely “no”). Did you get any error message at all?
Can you tell how many operations are required on the same data? I ask this because there are many registers (1024 as I know) per core so if u use PTX instead of CUDA an optimized combination may speed up your code 2-3 times, cuz you need no extra clocks to access a register, but memory needs 100+, this is not controllable from cuda, you’ll need to learn PTX and CUDA Driver API (I’m doing that at this time :D )
Whatever your code depends on, FPU power or mem bandwidth, 3xGTX260 will be much faster than 2xGTX280 :)
I have never heard of anyone that managed to get a significant speedup by using ptx instead of normal cuda code on the forum. In ptx registers are used only once, it is ptxas that performs optimization (and is quite good at it).
The actual decision to start to use local memory instead of registers is performed by ptxas, so ptx code has the same problem.
ptx is a nice intermediate format (OpenCL code will also be compiled to ptx). And people that are a fan of pascal can write pascal to ptx compilers this way ;)
kjulius, if you plan to run your program on consumer graphic cards (i.e. not Tesla), you should avoid big running time and try to fit within one second max. That’s because watchdog on Vista is 2 seconds. To keep workload reasonable for low-end and high-end cards it may be necessary to adjust number of blocks within a grid according to number of processors on installed card. If you follow those guidelines, your program will run on all cards, from low-enf 8600GT to high-end GTX280.
You will be surprised, but this is not controllable from PTX either. Register allocation happens after PTX generation, during ptxas phase, and I’m fairly sure it will rearrange things so that final result will be almost the same as with .cu files. No 2x-3x speedup ever.
Why don’t just switch on your brains? How CUDA code is compiled?
High-level OpenCL code will be compiled into PTX for NVIDIA cards. For ATI it will be compiled into IL. Rest will be done by driver. Or maybe you were dreaming about binary compatibility betwee all OpenCL-compliant cards?
Ok Ok, if it doesn’t matter then why is there .global and .reg definitions? If everything gets rearranged then there is no need of doing this :D I’ll test a few things about that and maybe a small performance comparison to CUDA :)