C1060 VS GTX295

Which one have better performance?

The GTX295 has better performance because it has two GPU’s. However, since they both share the same physical connection to the machine, their combined performance is a bit less than 2x. Also, the GTX295 only has 896MB of RAM per GPU, which can’t be shared between them (in case you were thinking you could use all of it for just one of the GPU’s).

It really depends on what you’re going to be using CUDA for. If your application would normally require a lot of host->device->host->device->host (ad nauseum) transfers, you might be better off with the C1060 since you won’t have the transfer overhead. If you’re running of a lot of compute-intense kernels on small amounts of data (or somehow, you’re just generating data on the device, e.g. random numbers) then you’re probably better off with the GTX295. You’ll also need to master multi-GPU programming in order to get full usage of the GTX295, which tends to be a bit more complicated (but not too bad) than single-GPU programming.

Ask Google about their specs. She knows all.

Also, it depends on the program (whether it scales to two GPUs for example).

This is slight OT, but if a single GPU card is what you are after, but larger memory is important, I noticed that Gainward have announced a number of interesting SKUs recently - a 2048Mb GTS250, a 1792Mb GTX260 and a 2048Mb GTX285. As consumer video cards, they are total overkill, but as CUDA compute devices, they offer an interesting alternative to the 512Mb, 896Mb and 1024Mb reference designs for memory hungry applications.

Would those memories be clocked lower? I’ve seen some designs that had more memory but used DDR2 instead of DDR3 to compensate (the cost? power consumption? don’t know)

Actually in our system with 3 and even 4 GTX295 (i.e. 6 and 8 GPUs) we’ve noticed a degragation in performance which is probably ~30%

compared to 1 half of the GTX. This is mostly because of the way the motherboard and the PCI controler works - all sorts of

hardware stuff I’m no expert at :)

In our tests we’ve also seen that half of the GTX295 runs approximatly the same as the C1060 and ~30% slower then the GTX280.

Of course “slower” is probably related to the difference in the cards bandwidth wise.

The best solution to prevent this degragation is to make the host->device->host transfers as few and small as possible.

I’m now trying to move a almost syncronious (i.e. not parralel) part of the application just to save those transfers, hopefully

it will keep the GTX295 still attractive to us.

As Big_Mac_ said, first look at the specs and then try to run some tests to see which is better for your application,

and maybe you can change the application/algorithm/move code to GPU in order to make your application more GTX295 friendly.

Also mind you that as far as I know the C1060 (for production and without the 50% discount nVidia announced) costs ~ twice

the price the GTX295 and you only get half the horse power.

Still nVidia strongly suggest that production environments should only use Tesla and not the GTXs.


You’re right…I heard it’s somewhere around 1.7x…“a bit” was a little vague.

I’m higly tempted to say GTX295. The 295 has a memory clock 25% faster than the C1060, so any memory bound kernels will run faster on the 295. That, and the GTX295 has twice the number of cores.

But as profquail said, it depends on what you want to do. If you need tons of memory, and splitting the data into several kernel calls puts a lot of overhead, the C1060 might be a better bet.

If you’re just getting started with CUDA, then definitely go for the GTX295. It gives you the possibility to learn homogeneous multi-GPU programming, and half of it will almost always be faster than the C1060.