which graphic cards support cuda and FP64 GFlops > 400


my organization needs to buy 40 GPU cards (preferring maxwell)
and our budget is around 650 $ per card
we are setting up an HPC environment using workstations and
our applications will require 15-20 TeraFlops in a networked environment

we already own 3 quadro k6000 but 400 GFlops FP64 is required per workstation
(in a decentralized manner).
You may suggest alternatives

The same has been posted on
but Nvidia is unable to help me

I couldn’t figure out how to deliver 400+GF of FP64 perf at $650 or less, using currently available new CUDA-capable GPUs.

However, if you can change your granularity, and accept 1.2+TF at $1950 or less, that should be doable with Titan, Titan Black, or Titan Z.

The Titan product family currently offers the highest DP Flops/$ ratio of CUDA-capable GPUs, AFAIK. Maxwell doesn’t offer anything comparable at the moment.

My concern is that I want to have CUDA and OpenCL
both usable for FP 64 in all the 40 workstations
I understand that I will require a combination of cards

AMD’s Radeon series is a bit flashy when it comes to performance/price
I am getting my required GFlops with Radeon HD 5850, Radeon HD 5870, Radeon HD 6950 and higher models within my budget
If I am using those developers will not be able to use CUDA and will have shift to OpenCL

GeForce GTX 590 offers FP32 2488 GFlops and FP64 311 GFlops but is Fermi based
I will not be able to use any advanced functionality of CUDA and related dev-kits if that’s the selection.

Why isn’t Nvidia offering any GPU offering mid-range FP64 capabilities?

you plan to slot 1 gpu per workstation (only)? do you need 40 workstations per se?

“My concern is that I want to have CUDA and OpenCL
both usable for FP 64 in all the 40 workstations
I understand that I will require a combination of cards”

a) why would you require a combination of cards?
b) how do you plan to use “cuda and openCL”

Yes, we need 40 workstations with GPU (1 or more in each)

The allocated budget is 30k $ for GPU’s in 40 workstations

The required performance is 15-20 TFlops from the network

So as per one of my estimations, the following combination seems to be fine:

10 x Titan Z (1500 each) = 15000 —> 1.3 TFlops each x 10 = 13 TFlops
10 x 780Ti/980 (850 each) = 8500 —> 0.2 TFlops each x 10 = 02 Tflops
20 x 750Ti/970 (300 each) = 6000 —> .04 Tflops each x 20 = 0.8 Tflops

Totals to 15.8 TFlops

CUDA and OpenCL will both be used for Scientific Development
Moreover, we require 3D capable GPU’s (20 No’s) and the above fits the criteria

Have better suggestions?

i have to double check, but your flops seem low for the 780ti

i am struggling to determine whether you need 40 workstations, and 15-20TF; or simply 15-20TF; there is a difference

txbob already mentioned 1 measure to note: cost per flop; another measure might be ‘housing cost’ or ‘host cost’ per gpu

to slot 10 gpus, you theoretically might need as few as 3 workstations - there are decent workstation - not even server - motherboards that can slot up to 4 double width cards

moving from 40 workstations to 3, is significant
perhaps better specify host side requirements

“CUDA and OpenCL will both be used for Scientific Development”
agreed, most gpus that support cuda equally support opencl
but as far as i know, the 2 can not be combined; i need to check on this

We need both 40 Workstations (for 40 people to work - host side requirements)
and 15+ TFlops for computation (using rCUDA and inhouse development frameworks)

I understand that motherboards as of today support > 8 GPUs
(8 x 1.3 TF = 10.4 TF on a single housing, if used Titan Z)

so our requirements are met by just 13 Titan Z and 2|3 workstations
the GPU’s of which cost 13 x 2000 = 26000 which is in budget

but we also require 40 consoles. I looked into Nvidia Grid but they seem unfeasible for development

“I understand that motherboards as of today support 8 GPUs
(8 x 1.3 TF = 10.4 TF on a single housing, if used Titan Z)”

the cost per slots per motherboard moves disproportionally; hence, i suggest shopping around
some of the 4 slot motherboards are far more economical than the 8 slot motherboards
else, if you have cash to flash, let me not stand in your way

also, one of the titans has triple width, if i am not mistaken; this is something to take note of, as the boards generally cater for/ mind double width

“for 40 people to work”


development time should be a fraction of deployment/ service time; hence this should support the argument for a ‘compacted’ cluster: deploying 40 workstations just because initially 40 workstations were needed for development seems illogical, i would think

rather, this should argue for proper (increased) project management and scaling, etc

a number of hypothetical options come to mind with regards to your “40 console” requirement
cuda generally supports remote debugging
i suppose it might even permit remote compilation; directly or indirectly
remote desktop might be a possibility for individual use cases
you likely do not need a gpu for/ when developing; only when compiling/ testing; this also implies some degrees of freedom
lastly, perhaps you can virtualize your cluster/ turn cuda into a service whilst developing

i admit a number of these are rather ‘raw’ suggestions

So is it better to move with the above (again listed below)???
Yup, local compiling/debugging/etc is required

10 x Titan
10 x 780Ti/980
20 x 750Ti/970

Its a hell of an investment going in trouble as Nvidia is not responding

FYI, these consoles will in future be a part of HPC environment
(consisting of k80 and XeonPhi)
cluster starting with 30 TFlops for 500,000 $

Titan Z delivers more than 1.3TF in aggregate, I believe.

Your suggestion is reasonable. My previous comment was based on this statement you made:

“but 400 GFlops FP64 is required per workstation”

which I seem to have misinterpreted. You simply want an aggregate of 15TF available along with compute capability in every workstation. Your suggestion gets you there. You might consider putting a GTX 970 in every workstation, and have 12 of those also include the Titan-Z as an “extra” GPU:

40 x GTX 970 = $12K
12 x Titan Z = $18K

(using your cost figures)

This will give you some additional “consistency” across the cluster. Every workstation will have the GTX970, so it has the latest compute capability (cc5.2) and a given code targetting the 970 should run anywhere. Then you have 12 “super” workstations that have the Titan Z and deliver most of the aggregate FP64 flops.

Titan Z gives more than 1.3TF in aggregate, I believe. Looking at both GPUs together, I believe the DP flops can be up to approximately double that number:


So given the constraints, I believe the above config would increase the delivered aggregate DP flops, and give you the latest compute capability on every workstation, for development consistency.

Sorry you’re having trouble with the other inquiry you made, but perhaps it is not the best portal to ask about a sales-oriented question (and there’s been a holiday period recently). If you are working with a system integrator, such as Supermicro or the like, to help you build your workstations, they should have staff who can answer your questions. If you are doing the work yourself, then this forum is a reasonable place to get opinions. It’s not clear how to maximize your offering unless you define the specific maximization criteria precisely. I believe my suggestion offers more aggregate peak theoretical DP flops than yours does, but that is just one criteria.

And if you don’t desire the development consistency across the cluster, you can further increase your aggregate flops by going to something like:

25 x GTX 970 = $7500
15 x Titan Z = $22500

“So is it better to move with the above (again listed below)???”

my point was that a project spends far more time in the field, doing ‘its thing’, than it takes to develop it; you need a hall to house 40 workstations; you need a room to house 3 workstations; if my project is going to stay in the field for 5 years, and it took 6 months/ a year to develop it, i would favour 3 workstations over 40

“FYI, these consoles will in future be a part of HPC environment”

you continuously seem to wish to ‘scatter/ gather’ (the parts of) your hpc cluster, which to me is contrary to hpc design philosophy

rather, i would be of opinion that one should ‘scatter/ gather’ your software or development or developers

one thing about gpus is that they tend to scale/ pilot easily; make use of this

when developing hpc, i do not see 40 developers developing simultaneously on a cluster; i might however conceive of x senior developers, y junior developers, and z project managers/ project leads/ specialists; and all of them working on laptops most of the time
i also do not see them constantly taking the cluster apart and reassembling it

"(consisting of k80 and XeonPhi)
cluster starting with 30 TFlops for 500,000 $ "

it is reasons as such why i would find it hard to purchase k80/ phi
$500k == lot of ‘ordinary’ flops

the reference to double width/ triple width is a reference to how wide or ‘fat’ the gpu card is; hence, how much space it requires to slot
on low end motherboards, you may have plenty of pci slots, but manage to use few, because the gpu cards are ‘fat’ - its like the man who needs to buy 2 tickets for the cinema, because he is (very) obese
higher end boards mind this, but generally for double width boards
the particular gpu specs generally notes whether the card is single width/ double width/ triple width