Recommended graphic card (OpenACC)

I have some experience in parallel programming (MPI and OpenMP), and I am planning to dive into GPU programming as well. I do scientific computing, mainly stuff with double precision floating point operations. I am especially attracted by OpenACC (but I will likely explore CUDA Fortran as well). The point is that I have no experience with graphic cards - not even for gaming - therefore I really need some help.

My present platform is: Asus Z170 Deluxe + Intel i7-6700 @ 3.40GHz (Skylake) / 32 GB RAM.
I only use GNU/Linux (at the moment, Linux Mint 18.1).

I do not plan to become a gamer or to do video editing, so I would use the graphic card basically only for scientific computing (for the rest of my activities, the integrated Intel graphics that I am using now would be just fine).

So my question is: What would you recommend to buy? If it makes sense, I would try not to go beyond ~300 euro, more or less.


For that price 64-bit precision will be hard to find in a new GPU. Are you sure that you need 64-bit, as 32-bit floating point computation in CUDA is very accurate due to device supported 32-bit FMA operations.

Also OpenACC does not have good reviews, as it is slow and the performance is nowhere near what you would get in CUDA.

For 64-bit the Tesla line or an older Titan Black would be your only options, but still way over your price limit.

If you could get away with 32-bit for some of your computations then the GTX 1070 would probably be the best choice.

I am aware of the fact that OpenACC is not as fast as CUDA, but my codes should be portable also to other platforms.

Working in single precision is certainly possible in many cases but… Excuse me for the naive question: what if I work in double precision with the GTX 1070? Is it better not to even think about it, or is it still feasible, when necessary?
Thank you.

If the application (and the algorithms used) map well to a parallel architecture then even if the 64-bit computation is slow the better memory bandwidth and the concurrent computational capabilities may still improve performance.

The GTX 1070 can do 64-bit computation but nowhere near the speed as 32-bit.

Often for application such as particle simulation most of the particle movement tracking is done in 32-bit and only the energy accumulation needs the 64-bit capability. In this case even a GTX 1070 will give you superior performance over a typical CPU. But again that is true using CUDA, while OpenACC is far less desirable option.

I am not sure of pricing in Europe, but I don’t think a GTX 1070 would fit a 300 Euro budget (European countries tend to have a pretty steep VAT rate). A GTX 1060 might. That would give you a theoretical performance of about 110 GFLOPS (double precision). The i7-6700 @ 3.40GHz has pretty much the same theoretical floating-point throughput, about 110 GFLOPS double precision.

I don’t know how much efficiency OpenACC can provide, but I think it is likely that you will get closer to these theoretical GFLOPS rates on the GPU than on the CPU, partially because a GTX 1060 gives you about 5x the memory bandwidth of an i7-6700. So for learning about GPUs and OpenACC, such a configuration would be fine, just don’t expect any massive speedups when you move double-precision code from the CPU to the GPU.

If affordable double precision is your hard requirement, look at old used Quadro cards from the Fermi and Kepler generations. They have “unlocked” double precision of 1/2 (for Fermi) and 1/3 (for Kepler) FP32 rates. For example, a used Fermi Quadro 6000 is only 250 Euro on Ebay France, but gives a very powerful 515 GFlops of double precision performance. Note how fast this is even compared to the very fastest consumer GPU, the GTX 1080 Ti (released today!), which has only 332 GFlops of double precision throughput. Fermi is really old (and even becoming depreciated in CUDA 9+), and doesn’t have some important modern features (no FMA, no SHFL) but it’ll be hard to beat its DP-throughput-per-euro. Kepler Quadros are more modern options but with a lower DP performance ratio.

And, while it may be hard to access, even your Intel IGP has surprisingly powerful double precision throughput… 1/4 of single precision. Accessing that may be difficult, though, even through OpenCL, as allanmac noted in an old Intel forum thread.

sm_2x does offer single-precision FMA, you may have been thinking about sm_1x? If I recall correctly, sm_2x does not allow single-GPU debugging, hooks for that were only added in Kepler (sm2x). My memory is hazy.

Since software support for sm_2x devices is likely to disappear real soon now, I would not want to steer someone in that direction who is just starting out with CUDA.

You’re right! Fermi does have FMA… I mistakenly thought that was introduced by Kepler. But yes, Fermi is painfully outdated tech… part of the reason the old GPU is so cheap despite the DP power. The Quadro 6000 was originally about 5000 Euro when it came out in 2010.

The next best DP-per-Euro ratio card under 300 Euro is probably the GTX 1060 you recommended. That’s also about 250 Euro and gives about 150 DP GFlops.

Still, Undy, I think everyone is in agreement that you’ll be much happier if you can reformulate your math and algorithms to perform well using mostly single precision computation. You’ll still have DP available, but if you minimize its use you’ll get fantastic computational throughput with just single precision. You’ll find that most scientific programmers, from supercomputer down to embedded, try to minimize DP for that reason no matter what their field or technique (PDEs, Monte Carlo, molecular simulation, FEM, BEM, PIC, ray tracing, FFTs, multipole…)

As a personal example, I store tens of millions of polygons of complex geometry of electromagentic domain boundaries in a data structure that has a double precision “center” for each leaf node of a spatial partitioning structure, but the geometry inside of each node is defined with single precision floating point offsets relative to that center. In my old raytracing GPU code, transformation matrices between objects and camera were double precision, but the geometry within each object’s bounding box was in floating point, again relative to the object’s center. As a side bonus, geometry storage size is halved in both cases.

Really valuable and useful information. Thank you so much.

As I said, most of my stuff can perfectly be done in SP. Admittedly, I never really paid much attention to this issue before…

The GTX 1070 is actually quite expensive (I can’t find it for less than 400 Euro). The GTX 1060 seems to fit better my budget at the moment.

I see models with one, two, and even three fans. When they crunch numbers, are these cards loud?

I have a very quiet Seasonc X-750 power supply in huge case (Antec P280) that I have lovingly adorned with super silent high-end fans. Almost inaudible. I expect that I will have to hear it in the future, but I would love to avoid any unnecessary noise, so I dare ask: do you have specific recommendation for keeping it as silent as possible?
Thanks again.

If memory serves, the GTX 1060 offers around 3.5 TFLOPS single precision, so if your use cases are amenable to the use of single precision, you should see good performance. Keep an eye on memory bandwidth though, that has become a limiting factor for many real-life use cases. FFT is a classical example of a memory bound computation.

There are many different vendor versions of the the GTX 1060. As for which is the quietest under full load, I would suggest reading online reviews to find out. Personally I check out Tom’s Hardware and AnandTech, mostly because I have been reading these sites since the 1990s (I even met Tom in person when he had just started out; he sold the site years ago, I think). There may be better review sites now. A GTX 1060 only draws 120W or thereabouts under full load, so unless a vendor screwed up their cooling solution royally, I would not expect particularly high noise levels.

The fan profiles for NVIDIA consumer cards are typically such that they favor lower noise over higher operating temperatures, to the chagrin of those who want to use these cards professionally with a 24/7 duty cycle and would like the fans to crank near 100% all the time.

Personally I would stay away from aggressively vendor overclocked models, as I have doubts that these are fully qualified for compute loads at those clocks. Some moderate vendor overclocking vs NVIDIA reference clocks seems OK, specially for models that have been on the market for a while, like the GTX 1060.

From what I have found, it seems that MSI Gaming X has the best temperature/noise ratio. Everybody’s praising it. The Zotac AMP! seems to be an excellent choice, too. (They are both ~300 Euro in Europe).

I am truly grateful for the help!

  1. if you need max portability, you can also use OpenCL 1.2 (note that OpanCL2 osn’t supported by NVidia cards). This language is opretty low-level, though, but there are tons of higher-level bindings, including f.e. Boost.Compute. OpenCL by itself has a higher-level C++ binding, but i don’t know how wide it’s available

  2. If you more interested in DP perfromance, you may find AMD cards more interesting. Anyway, you plan to develop portable code so you don’t need to stick with NVidia