4x RTX Titan and NVLink

Hello,

My team is building a new workstation for Deep Learning training and we’re planning on having 4x RTX Titan with Supermicro MBD-X10DRG-Q motherboard and 2x Intel Xeon E5-2620V4. My question is regarding the use of NVLink. Since NVLink can only bridge two graphic cards together, how can it work on this system?

I thought it was possible to use 2x NVLink, creating two pairs of bridged cards, lets say, bridging cards 1 and 2, and cards 3 and 4. But I’ve been told by a Nvidia Customer Care representative that this setup isn’t possible. Is that correct?

If I can’t use two NVLinks at the same time, would it be possible to only have 1 installed, bridging cards 1 and 2, and have the other two cards unbridged?

My main concern is to have, at least, one configuration with 48GB of RAM available (and I say 48GB of RAM available because, as I’ve also been told, NVLink is supposed to “virtualize” two graphic cards together, essentially making them look as one). How can this be achieved?

Our applications should be anything with Keras, Theano, Caffe, TensorRT, and so on…

Thank you.

Hello,

Yes, TITAN RTX NVLINK BRIDGE only bridges 2 RTX GPUs together. NVLink won’t virtualize the GPUs. You’ll still see multiple GPUs. You’ll just have double the effective GPU memory capacity to 48 GB and scale performance with up to 100 GB/s in total bandwidth of data transfer.

How about using two NVLInk Bridges at the same time, though? Does it even work?

Yes, two NVLink bridges should work. for example, bridging GPUs 1,2; bridging GPUs 3,4.

Of course you’ll need to make sure you have the PCI lanes between the 2 bridges to handle traffic between GPU {1,2} pair to {3,4}

The motherboard I’ve chosen is the Supermicro MBD-X10DRG-Q, which supports dual processing… So I’m gonna be using two Intel Xeon E5-2620V4. This setup should ensure me 80 PCI Lanes.

Is that enough?

I have the exact same setup, 4x Titan RTX and use 2 nvlink bridges to bridge 2 cards, each. It works perfectly fine and I get bi-directional throughput of around 93.xx GB/s.

HOWEVER, one problem I noticed is that shortly after, the temperature of each card goes up by almost 20 degrees Celsius in idle mode. Before linking the cards it hovers at around 35 degrees. After linking 2 cards each temperatures go up to 55 degrees shortly after.

@Moderator, any idea why that is the case? Both temperature measures are taken in idle mode without any workload. Also, air circulation is ample in my ESC8000 G4 GPU-Compute server case.

Remember that multiple CPUs don’t just add their lanes together, you’ll have limited connection between the lanes on CPU 1 and CPU 2. Not sure how that motherboard handles it, but you’ll probably have a bottleneck there as it’ll have to go through 2 CPUs.

You can limit this by making sure that the motherboard lets you hook up GPU 1-3 to CPU 1 and 2-4 to CPU 2 with NVLink between 1-2 and 3-4. This, along with Nvidia’s intelligent routing will ensure that the GPUs will communicate with each other and each CPU using the fastest method available be it NVLink or over PCI-E.

Similar issues arise when using Epyc/Threadripper since each die is responsible for 1/2 of the PCI-E lanes.

How high is the memory bandwidth between the 2 TITAN RTX GPUs using the NVLINK bridge, compared to between the V100 GPUs in the Nvidia DGX station?