Fast Multi-GPU collectives with NCCL

Originally published at:

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to…

Hi Nathan, I've looking forward to this project to come out!
I have a question: If I want to perform a 2D domain decomposition (using a Cartesian domain for simplicity) with NCCL, how would you advice to perform the halo regions exchange using only the 5 collectives in the project?

This is an important library for our work. We use numerous multi-GPU applications. Having a topology-aware library should ease our programming burden.

Does this tool support the P8NVLink system with P100 GPU? From the blog below, this tool need the GPUDirect but OpenPOWER system doesn't have GPUDirect. That is why I ask if this tool is ready for OpenPOWER P8NVLink system with P100 GPU.

Nvidia blog
"NCCL makes extensive use of GPUDirect Peer-to-Peer direct access to push data between processors."

NCCL will automatically fall back to pinned buffers if P2P direct access is not available. That said, the GitHub version of NCCL is not optimized for NVLink, so the achieved performance will be similar to PCIe rather than NVLink bandwidths.

Does NCCL provide benefits when using a heterogeneous set of NVIDIA cards or must they be a homogeneous matched set? For example, will I see benefits if I have a GTX 980ti and GTX 1070 in my machine?

For best performance, NCCL uses peer-to-peer direct access between GPUs. When this feature is not available (e.g., when different models of GPU are used together), it stages transfers through pinned host memory instead. This definitely impacts performance, but should still outperform memcpy-based solutions.

Interesting library, which I find useful for my multi-GPU scheduling library. Are there mechanisms in place that can be used to identify when a particular GPU has received all of the data it needs to begin processing?

NCCL uses CUDA Streams to track dependencies while executing GPU tasks asynchronously with respect to the host thread. When calling NCCL you provide a stream handle into which NCCL enqueues it's kernels. If you use this same stream both to populate the send buffers before the NCCL call and to consume the collected results afterward, the CUDA runtime will ensure that the steps are executed in the correct order.

I am evaluating NCCL in our DL training. Great work!
I run some quick test such as NCCLv1 across 4 P100 (PCIe) in the same node, but I didn't find any d2d/p2p API invoke when profile using NVProf. so any internal APIs used for cross-dev copy? how to profile and monitor such traffic e.g., through nvprof?

I actually post the question with more details in Nvidia tech forum (below URL), but nobody answered, could u help me out? thanks.

To avoid launching interleaved kernels and memcpys, NCCL uses peer direct writes between the GPUs located on a common PCIe root hub (and zero-copy memory when traversing QPI). This means no cudaMemcpy calls will appear in the profile. Instead there will be a single kernel launched on each GPU. To gauge raw NCCL perf, you can use the minimum kernel time from all GPUs (the collectives need all GPUs to be present, so early arrivals spend time spin-waiting for the last device to arrive).

thanks for the reply.
May I confirm that those internal data moving are triggered in the kernel and hence underneath CUDA driver? Is such moving traffic included in nvidia-smi when query PCIe rd/wr? if yes, I guess it is shown in both src.GPU read and dst.GPU write? pls confirm.

In cross-node case, if GPUDirect (RDMA) not available, will it goto CPU via memcpyD2H then go to networking? if with GPUDirect, it could be RDMA (NCCLv2)?

thanks again.

I am not familiar with nvidia-smi's PCIe query. As far as I am aware these are not available on x86 which is the platform I predominantly use. I would expect more "writes" than "reads" since NCCL tries to push rather than pull data.

For multi-node you are correct. GDRDMA is used, but a host-memory fall-back is used when the GDRDMA is not available.

Thanks Nathan for writing the blog.

I have an orthogonal question regarding the method you used to measure the bandwidth. What tool can I use to measure bandwidth between various links on a server with 8 GPUs?

Note: I came across bandwidthTest which is part of the CUDA samples but I would like something that can help look at bandwidth usage for all links over time.

Hi Nathan, what is the profiling tool you used to get the performance link bandwidth achieved by various NCCL collective?