Programming with NVLINK

If one has a workstation with 2 Quadros connected via NVLINK ( either GV100 or GP100 ), can the processors be accessed from only 1 board when launching a kernel ? I have read much about accessing the memory of 1 card from the other card but not the processors. In other words, can the 2 NVLINK connected boards be used as a single board with twice the memory and twice the processors so I don’t have to mess around with setting a different device and doing another kernel launch call?

To the best of my knowledge, currently, no. Whether NVIDIA’s just announced NVSwitch technology will support the construction of virtual GPUs from multiple devices as you envision it seems to be an open question.

That being the case, it would seem that the value of NVLINK is somewhat limited. One can use it if > 32GB memory is needed or perhaps can use it to synchronize between GPU boards using shfl_sync or atomics but what else? Am I missing something?

Yes, performance. PCIe gen 3 is a tiny straw compared to the fat pipes of NVLINK 2.0. For a high-end system, GPU memory throughput is in the 400-900 GB/sec range, the host system has memory throughput of 60-100 GB/sec, but in between you have a PCIe link limited to 12 GB/sec.

That can be a bottleneck for certain applications. NVIDIA has published benchmark results from several applications that show a definite benefit from NVLINK due to the ability to shuffle data faster between GPUs, and between GPU and CPU.

I have read the docs and specs and I believe they are all based on the mezzanine Tesla V100s that plug into the NVLINK backplane that is part of a IBM Power system.
However, that is quite a different environment from a workstation that has 2 Quadro QV100 boards with the 2 NVLINK bridges that fit on top of the boards. I fail to see how the workstation can enjoy the CPU -> GPU memory speedup often quoted. Where is the CPU -> GPU NVLINK path ?
I see how it will speed GPU -> GPU transfers, but getting data between the CPU and the GPU will still be PCIE limited.

There is no CPU -> GPU NVLINK path for x86 systems at this time. If enough customers demand it, that may convince AMD and/or Intel to add NVLINK to their CPUs, but I would think they are probably too busy defending their turf to not stand in the way of technological progress. I know PCIe gen 4 is being worked on, but as far as I am aware this will just double throughput compared to PCIe gen 3. NVIDIA obviously is interested in working with any company willing to add NVLINK capability, with IBM being the first taker.

The way I understand the x86 market of the past five years or so, the strategy of the dominant CPU maker is to partition the remaining performance gains to be had from Moore’s Law into as many tiny incremental hardware updates as possible to maintain their x86 revenue streams for as long as they can.

For x86-based systems, the advantage from NVLINK is in the increased bandwidth (and, best I know, reduced latency) of GPU-to-GPU communication in multi-GPU setups.

Thank you very much for your comments and insights, njuffa. My purpose was to make sure I knew exactly what advantages the Quadro GV100 provides before purchase and I have definitely learned that.

One major advantage as compared to a Titan V is the large amount of on-board memory: 32 GB. In my experience, a large GPU memory is helpful to GPU-accelerated applications much more often than fast links. But as I pointed out previously, there are certainly applications that do benefit from fast links, as highlighted in NVIDIA’s communications.

Whether the Quadro GV100 is cost effective at the price of $9000 that I see reported on multiple websites is an entirely different story, and will depend very much on your use case. But at a power draw of a mere 250W it is certainly an impressive engineering marvel.

Here is the kind of promotional hoohah I came across that prompted me to ask my initial question.
This is a cut & paste from PNY’s GV100 website:
NVLINK2-2W2S-KIT provides two NVLink connectors for the GV100, effectively fusing two physical boards into one logical entity with 10240 CUDA cores, 1280 Tensor cores, and 64 GB of HBM2 GPU memory,"

Now you can see how my initial confusion arose.

I agree that this seems to imply that the CUDA version currently available (that is, 9.1) would report such a configuration as a single GPU, rather than as two GPUs for which a multi-GPU programming approach has to be used to utilize it fully. I am not aware of such functionality, but it is of course possible that there is functionality I am not aware of. Only NVIDIA can provide you with an authoritative answer regarding features of their products.

The description on the website of Exxact Corp ( matches my current knowledge about the benefits of using NVlink bridges with Volta. Interestingly, both PNY and Exxact are listed as official NVIDIA partners: