I am thinking of using CUDA/GPU and F# to (massively) improve some bioinformatics algorithms I have.
Can anyone suggest a good setup (motherboard/ RAM, GPU etc.) for a relatively cheap cost. I don’t need maximum performance just yet, just proof of concept that such an approach will work.
There is a very long thread extolling the virtues of new and inexpensive GPU cards based on the just released GK208 chip.
A GK208 with DDR3 memory is close to being the least expensive new card you can buy ($65?) and it’s one of the few (TITAN,TESLA-K,GK208) that support the new “sm_35” capabilities like Dynamic Parallelism and 255 registers per thread.
The inexpensive GK208 cards have 2 SMX’s (384 cores) vs. 14 (2688 cores) on the TITAN.
For bioinformatics you may be interested in the SIMD-within-a-word functions (available from the registered developer website), which have direct hardware support on sm_3x platforms.
Please be aware that there is a significant performance difference between low-end and high-end GPUs, both in terms of memory bandwidth and core processing capability.
If you want to learn how to do CUDA programming on a budget, stuffing a GK208 card into a spare computer is great bang-for-the-buck.
However, if you want to benchmark algorithms on a GPU, it is very important to use realistic-sized problems on a realistic-sized GPU. Things will not scale linearly over a large range of problem sizes or GPU sizes. This often leads people to conclude that CUDA is useless because they timed a trivial problem or used a very low-end card.
For that reason, if your budget allows, I would suggest purchasing either a GTX 780 ($650) or a Titan ($1k). If you need double precision or 6 GB of GPU memory, the Titan the better choice, otherwise the GTX 780 is nearly as fast. If you really need to keep the price down, then a GTX 770 is as low as I would go. The GTX 770 is a slightly older architecture (compute capability 3.0), which is why I would tend to avoid it for new purchases.
Aside from that, you want probably 2x as much CPU memory as GPU memory (roughly, depends on exact problem), a PCI-Express 3.0 motherboard and a separate, cheap GPU to run the display.
Thanks. That is exactly what I was after. No point cheaping out. To prove my concept I will still need the big boys. So just straight up to say a top end Tesla?
FYI the problems are currently using older technologies and taking upwards of 30 hours, on 16 cpus of a 1.3GHz shared memory. (molecular modelling)
Well, no need to go crazy now. :) The difference between the Tesla K20X and the Titan (aside from $3k) is not as big as you might think:
For non-double precision stuff, Titan has 12% higher instruction throughput and memory bandwidth than the K20X.
Titan’s double precision is controlled by a driver configuration flag. By default, double precision is 1/24 the throughput of single precision. However, the full double precision performance (1/3 of single) can be enabled by the driver. The tradeoff is that when full double precision performance is enabled, the GPU clock is more limited, slowing down non-double precision instructions.
Tesla has two DMA engines, so it can overlap bidirectional transfers over the PCI-Express bus with kernel execution. Titan only has 1 DMA engine, so data can only transfer in one direction at a time.
Tesla has the option of turning on ECC for device memory, at the cost of some memory bandwidth. Titan has no ECC.
Tesla can use special “TCC” drivers on Windows, which have less overhead than the normal Windows drivers. On Linux, this makes little difference.
Tesla has some other more obscure features, like RDMA (copy data directly to/from certain devices without CPU intervention, like Infiniband cards), and special support for using MPI programs with HyperQ.
Basically, NVIDIA seems to be advocating a model of: “Develop on Titan, deploy on Tesla.” This makes a lot of sense, especially now that the full double precision performance can be unlocked on Titan if you need it.
Nice summary seibert. On that note, I wanted to add – in regards to the simpleHyperQ example, the limitation is that Tesla K20/K20X can do up to 32 concurrent kernels. Titan (and GK208, tested that a few days ago) can only do 8 concurrent kernels.
One question on this
If I have 4 concurrent (and equal) transfers
gpu 0 > gpu 1
gpu 1 > gpu 2
gpu 2 > gpu 3
gpu 3 > gpu 0
will the two DMA engines double the throughput of these transfers when compared to a Titan?
With 4 cards transmitting in a ring like this, you have to worry about how the motherboard routes PCI-Express to figure out the total available bandwidth. All the single-socket motherboards I’m aware of have less than 36 lanes of PCI-E, so when the support full rate data transfers on 4 slots, they do it with PCI-Express switches.
For example, if gpu 0 and 1 share a switch and gpu 2 and 3 share a switch, then in principle I think you could double your throughput with K20 cards. Whether that happens in practice depends on the behavior of the PCI-E switches, the motherboard chipset, and maybe some obscure BIOS stuff. I wouldn’t count on it without testing with the exact motherboard and CPU you plan to deploy. :)