CUDA hardware question

Hello all,

I’m putting together a machine learning gpu compute build and I have a couple of questions, before I start buying components.

Objective here is best training perfomance within the budget I have.

My setup will include 8xGTX1080Tis (latest generation). My question is regarding motherboard selection.

I am looking into picking between two barebone offerings that SuperMicro has: The SYS-4028GR-TRT and the SYS-4028GR-TR2.

The difference between the two is the motherboard and specifically the daughterboard that houses the GPUs.

The TR2 has a single root complex, using PCIe x16 lanes on a single processor to handle all 8 GPUs via 2 Switches. In this setup (and since all GPUs will be on the same CPU) the GPUs should be able to utilize local peer to peer communication which will speed up backprop, I would think. But I’m speculating that since the GPUs will now be loaded using only 2 PCIe x16 lanes coming from the one processor, that might hinder performance.

The TRT on the other hand splits 8 GPUs among 2 processors connected via QPI, using 4 switches. In this case although GPUs 0-3 will be able to communicate via peer to peer, 0-3 wont be able to communicate with 4-7 without going through QPI, which is slow. But dataloading will be faster since now we will be using 4 x16 PCIe lanes from the combined two processors.

Ideally I would be using Teslas and PCIe SSDs that would use RDMA. In that case a single root complex (and hence the TR2) would be a clear winner. But with the GTX line, RDMA is only available for peer-to-peer communication between local GPUs. So potentially a single root complex might slow down things, since now all the dataloading happens through 1 CPU only?

My question is, which configuration would be superior? I would appreciate some input before taking the (not so easy to swallow) plunge.

Thank you.


This link might have the answer you are looking for: