GTX260: arch improvements for CUDA?

Hi

I am a new member and I am trying to decide on a CUDA card for a server (i.e. no fancy graphics required).

Now, I see that the GTX260 is a new arch, has more stream engines and double precision etc. However I want to try integer workloads. Do I gain more by buying 2x8800GT over a single 260?

What are the architectural improvements of the new 260 series?

Thanks
-D

It’s a huge improvement.

Register count is doubled (this is huge.). Shared atomics. Double precision support.
Much more versatile coalescing rules. Support for more threads in flight. Great memory bandwidth. fused multiply-add.

It’s a big deal… the G200 architecture improvements were really tuned more for CUDA than graphics!

Thanks! Is there a study or review which demonstrates this? I am hesitating to jump on to the 260 because then I have to also upgrade my power supply etc. Of course if its a big deal, I will …

Look at the CUDA 2.0 programming guide–it’s explained under the “Improvements in Compute 1.2” section near the end. It also supports Compute 1.3, which gets you double precision as well.

Two 8800GTs may be faster, but there’s no SLI in CUDA. You’ll have to program them separately, which can be a bitch.

Anyway, the difference between GTX260 and the older gen isn’t “huge.” It’s nice. It’s critical if you need that new atomics feature. But otherwise if power consumption is what guides you, don’t worry about it.

But really, are you putting this into production? If not, then who cares if you have the fastest card for development. If yes, then you should know what performance you need, and you should test to see what you get. And cooling and reliability would be big factors too.

Doubling the register amount is huge for some of my algorithms. Before I was not getting anywhere near memory bandwidth performance because of low occupancy.

Sure, if you have low register count, always coalesce, no use for shared atomics, warp voting and double precision, then the only difference will be the increase in memory bandwidth.

It’s worth pointing out that even a 2x speedup is merely incremental when your algorithm has been sped up 50x as a result of being run on a GPU.

To make a more tangible argument: by this time your bottlenecks are probably elsewhere.

Anyway, hopefully a less power-hungry g200s will be released within 4-6 months. (NVIDIA is using an older process technology and needs to die shrink badly to compete with ATI.)