it is no problem if the x16 slot is just connected thru x8. it will only reduce your host-device bandwidth.
all 3 cards mentioned are basically the same, just c1060 is missing the display ports and instead has a bunch more memory. gtx285 on the other hand is also the same chip, just shrunk in size with a new manufacturing process (meaning: less power consumption/heat at same speed, higher overclockable) everything else apart from slightly higher clock rates is the same as in gtx280. gtx295 is two cards somewhere inbetween 260 and 280 but produced with the manufacturing process of 285 “glued” together, so both cards share the same port.
all of them support double precision.
Another thing to keep in mind is that the two GPUs in the gtx295 share the same PCI-Express connector. This means that whatever the bandwidth of your bus is, each GPU will only get half if you are accessing them at the same time (as you are likely to be doing if you’ve split your workload over both cards).
if you don’t need a graphics card to show “nice” 3d graphics, you can just go with the cheapest 8 or 9 series card, you can find. e.g., the low-budget 8 series only uses like 40W and a 9400 is at 50W.
you should also be able to find at least some 8 series as pci (not pci-e) version, if you don’t want to occupy a pci-e slot.
Of course, if you don’t need the 4 GB of memory in a C1060, the easiest solution is to install a single GTX 280 (err 285 I guess now) instead. It is the same size, uses the same power connectors, and even runs a little faster. :)
True, but more memory bandwidth can also not hurt (the GTX 280 memory is clocked higher than a Tesla), so you’ll have to pick. :)
To first order, the memory you use in CUDA will be similar to a CPU version of your code. As you make your algorithm more CUDA-friendly, you tend to use less memory, not more, since many CUDA programs are memory bandwidth limited, rather than FLOPS limited.
Don’t get me wrong: I’m not saying you shouldn’t purchase a Tesla C1060, but you should be aware that the Tesla compute devices are not unequivocally the best solution for all CUDA tasks. There are tradeoffs.
Another option instead of the 285 would be to look at the discounted GTX 280s:
a, get a set of workstations that have enough power to power either of these cards for CUDA use. (obviously the gtx has video so that is good)
b, get a set of servers for our clusters that again can accommodate the full length and double height cards + have enough power to power them.
c, they need to be double precision cards.
Since the s1070 is selfcontained - we are okay with that. But it will take care of only two servers and they are expensive to roll it out on 20 + machines.
It is not about the computational or graphics “power” of the cards. Unfortunately it is about real power and PCI form factor. We would be happy with a small starter card - but all double precisions are high end.
Right now we are just thinking large form factor workstations and amending the power supply with a modular unit like:
The lowest-power G200 card will be, according to Wikipedia, a new 55nm GTX 260 Core 216 that’s being released now. (There is no new name for it.)
That, I think, is the best solution. For example, you don’t have to worry about sizing up the PSU (which is not too easy. Besides estimating the power consumption of all the other components, you must see how the demand is split among the PSU’s rails). It might also be more reliable.