Im building a HPC machine with multi GPU setup.
Most of the CUDA based calculations will be performed in single precision. Probably also a bit of Deep Learning will be done on the machine. No SLI is required. However a huge amount of standard RAM is required
I have spend some time to design this config,
could you please provide any comments if some pieces could be replaced or if there are some obvious miss-configurations in this set-up
There are people in these forums that have built ambitious multi-GPU systems similar to yours, so definitely wait for them to chime in. Here are some thoughts of mine you may want to consider:
I assume you are looking for a single-CPU socket workstation configuration. I would suggest a CPU with 40 PCIe lanes, for maximum throughput between the CPU and GPUs.
Since it looks like most of the parallelizable work is to be offloaded to the GPUs, I would suggest going for fast CPU cores (>= 3.4 GHz base clock), rather than a lot of CPU cores, so as not to become limited in the serial portions of your workloads.
High throughput system memory is always a good idea for an HPC box; since you also want quite large system memory I would suggest a CPU with ECC support so the integrity of all that storage is ensured.
The previously items taken together lead to e.g. Xeon E5-1650 v4 (~ US$ 630; 6 cores; with four-channel DDR4-2400 up to 76.8 GB/sec theoretical bandwidth).
For PSUs, I would suggest 80PLUS Platinum rated at this time, they aren’t hugely more expensive than the Gold-rated ones (+ $50 at most) and you will save on electricity over time. If you prefer the EVGA brand, you may want to look at their SuperNOVA P2 1600W. The sum of the nominal wattage of all system components should be 50%-60% of the nominal wattage of the PSU for optimal efficiency and robustness. The summed wattage of your components will be around 1000W it seems, so use of a 1600W PSU looks about right.
[Later:]
It seems that the Core i7-6850K CPU you listed above is basically the consumer part equivalent to the Xeon E5-1650 v4 I suggested. So if you decide to stick with consumer-grade components rather than using workstation-class components, that seems like a good choice.
Best I can tell from internet research the performance gain from “overclocked” memory is minimal, e.g. a four-channel configuration at 58 GB/sec measured memory bandwidth at DDR4-2400 vs 63 GB/sec measured memory bandwidth at DDR4-3200. The raw performance difference would dilute to noise level in terms of application-level performance. So simply going with DDR4-2400 may save you some money.