Recommended setup for trial GPU computing

Hi All,

I am thinking of using CUDA/GPU and F# to (massively) improve some bioinformatics algorithms I have.

Can anyone suggest a good setup (motherboard/ RAM, GPU etc.) for a relatively cheap cost. I don’t need maximum performance just yet, just proof of concept that such an approach will work.


There is a very long thread extolling the virtues of new and inexpensive GPU cards based on the just released GK208 chip.

A GK208 with DDR3 memory is close to being the least expensive new card you can buy ($65?) and it’s one of the few (TITAN,TESLA-K,GK208) that support the new “sm_35” capabilities like Dynamic Parallelism and 255 registers per thread.

The inexpensive GK208 cards have 2 SMX’s (384 cores) vs. 14 (2688 cores) on the TITAN.

NVIDIA also notes this same chip can be found in laptops.

For bioinformatics you may be interested in the SIMD-within-a-word functions (available from the registered developer website), which have direct hardware support on sm_3x platforms.

Please be aware that there is a significant performance difference between low-end and high-end GPUs, both in terms of memory bandwidth and core processing capability.

Good point by @njuffa. You said “cheap” but that comes with a fraction of the bandwidth of a TITAN/TESLA.

For a GT630 with a GK208+DDR3 it’s like 14.4 GB/sec vs. a massive 288 GB/sec for a TITAN. That’s 20x.

If you want to learn how to do CUDA programming on a budget, stuffing a GK208 card into a spare computer is great bang-for-the-buck.

However, if you want to benchmark algorithms on a GPU, it is very important to use realistic-sized problems on a realistic-sized GPU. Things will not scale linearly over a large range of problem sizes or GPU sizes. This often leads people to conclude that CUDA is useless because they timed a trivial problem or used a very low-end card.

For that reason, if your budget allows, I would suggest purchasing either a GTX 780 ($650) or a Titan ($1k). If you need double precision or 6 GB of GPU memory, the Titan the better choice, otherwise the GTX 780 is nearly as fast. If you really need to keep the price down, then a GTX 770 is as low as I would go. The GTX 770 is a slightly older architecture (compute capability 3.0), which is why I would tend to avoid it for new purchases.

Aside from that, you want probably 2x as much CPU memory as GPU memory (roughly, depends on exact problem), a PCI-Express 3.0 motherboard and a separate, cheap GPU to run the display.

Thanks. That is exactly what I was after. No point cheaping out. To prove my concept I will still need the big boys. So just straight up to say a top end Tesla?

FYI the problems are currently using older technologies and taking upwards of 30 hours, on 16 cpus of a 1.3GHz shared memory. (molecular modelling)

Well, no need to go crazy now. :) The difference between the Tesla K20X and the Titan (aside from $3k) is not as big as you might think:

  • For non-double precision stuff, Titan has 12% higher instruction throughput and memory bandwidth than the K20X.

  • Titan’s double precision is controlled by a driver configuration flag. By default, double precision is 1/24 the throughput of single precision. However, the full double precision performance (1/3 of single) can be enabled by the driver. The tradeoff is that when full double precision performance is enabled, the GPU clock is more limited, slowing down non-double precision instructions.

  • Tesla has two DMA engines, so it can overlap bidirectional transfers over the PCI-Express bus with kernel execution. Titan only has 1 DMA engine, so data can only transfer in one direction at a time.

  • Tesla has the option of turning on ECC for device memory, at the cost of some memory bandwidth. Titan has no ECC.

  • Tesla can use special “TCC” drivers on Windows, which have less overhead than the normal Windows drivers. On Linux, this makes little difference.

  • Tesla has some other more obscure features, like RDMA (copy data directly to/from certain devices without CPU intervention, like Infiniband cards), and special support for using MPI programs with HyperQ.

Basically, NVIDIA seems to be advocating a model of: “Develop on Titan, deploy on Tesla.” This makes a lot of sense, especially now that the full double precision performance can be unlocked on Titan if you need it.

Nice summary seibert. On that note, I wanted to add – in regards to the simpleHyperQ example, the limitation is that Tesla K20/K20X can do up to 32 concurrent kernels. Titan (and GK208, tested that a few days ago) can only do 8 concurrent kernels.

One question on this
If I have 4 concurrent (and equal) transfers
gpu 0 > gpu 1
gpu 1 > gpu 2
gpu 2 > gpu 3
gpu 3 > gpu 0
will the two DMA engines double the throughput of these transfers when compared to a Titan?

With 4 cards transmitting in a ring like this, you have to worry about how the motherboard routes PCI-Express to figure out the total available bandwidth. All the single-socket motherboards I’m aware of have less than 36 lanes of PCI-E, so when the support full rate data transfers on 4 slots, they do it with PCI-Express switches.

For example, if gpu 0 and 1 share a switch and gpu 2 and 3 share a switch, then in principle I think you could double your throughput with K20 cards. Whether that happens in practice depends on the behavior of the PCI-E switches, the motherboard chipset, and maybe some obscure BIOS stuff. I wouldn’t count on it without testing with the exact motherboard and CPU you plan to deploy. :)