Fast Multi-GPU collectives with NCCL

jwitsoe · April 7, 2016, 3:28pm

Originally published at: Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to…

anon23970270 · April 19, 2016, 6:17am

Hi Nathan, I've looking forward to this project to come out!
I have a question: If I want to perform a 2D domain decomposition (using a Cartesian domain for simplicity) with NCCL, how would you advice to perform the halo regions exchange using only the 5 collectives in the project?

anon92716381 · April 20, 2016, 10:45am

This is an important library for our work. We use numerous multi-GPU applications. Having a topology-aware library should ease our programming burden.

anon94217786 · November 17, 2016, 3:40am

Does this tool support the P8NVLink system with P100 GPU? From the blog below, this tool need the GPUDirect but OpenPOWER system doesn't have GPUDirect. That is why I ask if this tool is ready for OpenPOWER P8NVLink system with P100 GPU.

Nvidia blog
=========
https://devblogs.nvidia.com...
"NCCL makes extensive use of GPUDirect Peer-to-Peer direct access to push data between processors."

anon5634387 · November 18, 2016, 7:03pm

NCCL will automatically fall back to pinned buffers if P2P direct access is not available. That said, the GitHub version of NCCL is not optimized for NVLink, so the achieved performance will be similar to PCIe rather than NVLink bandwidths.

anon19982686 · January 17, 2017, 1:11am

Does NCCL provide benefits when using a heterogeneous set of NVIDIA cards or must they be a homogeneous matched set? For example, will I see benefits if I have a GTX 980ti and GTX 1070 in my machine?

anon5634387 · January 17, 2017, 2:09am

For best performance, NCCL uses peer-to-peer direct access between GPUs. When this feature is not available (e.g., when different models of GPU are used together), it stages transfers through pinned host memory instead. This definitely impacts performance, but should still outperform memcpy-based solutions.

anon31491172 · February 15, 2017, 7:55pm

Interesting library, which I find useful for my multi-GPU scheduling library. Are there mechanisms in place that can be used to identify when a particular GPU has received all of the data it needs to begin processing?

anon5634387 · February 21, 2017, 7:17pm

NCCL uses CUDA Streams to track dependencies while executing GPU tasks asynchronously with respect to the host thread. When calling NCCL you provide a stream handle into which NCCL enqueues it's kernels. If you use this same stream both to populate the send buffers before the NCCL call and to consume the collected results afterward, the CUDA runtime will ensure that the steps are executed in the correct order.

anon71954675 · November 1, 2017, 7:13am

I am evaluating NCCL in our DL training. Great work!
I run some quick test such as NCCLv1 across 4 P100 (PCIe) in the same node, but I didn't find any d2d/p2p API invoke when profile using NVProf. so any internal APIs used for cross-dev copy? how to profile and monitor such traffic e.g., through nvprof?

I actually post the question with more details in Nvidia tech forum (below URL), but nobody answered, could u help me out? thanks.
https://devtalk.nvidia.com/...

anon5634387 · November 1, 2017, 4:25pm

To avoid launching interleaved kernels and memcpys, NCCL uses peer direct writes between the GPUs located on a common PCIe root hub (and zero-copy memory when traversing QPI). This means no cudaMemcpy calls will appear in the profile. Instead there will be a single kernel launched on each GPU. To gauge raw NCCL perf, you can use the minimum kernel time from all GPUs (the collectives need all GPUs to be present, so early arrivals spend time spin-waiting for the last device to arrive).

anon71954675 · November 5, 2017, 12:08pm

thanks for the reply.
May I confirm that those internal data moving are triggered in the kernel and hence underneath CUDA driver? Is such moving traffic included in nvidia-smi when query PCIe rd/wr? if yes, I guess it is shown in both src.GPU read and dst.GPU write? pls confirm.

In cross-node case, if GPUDirect (RDMA) not available, will it goto CPU via memcpyD2H then go to networking? if with GPUDirect, it could be RDMA (NCCLv2)?

thanks again.

anon5634387 · November 17, 2017, 12:56am

I am not familiar with nvidia-smi's PCIe query. As far as I am aware these are not available on x86 which is the platform I predominantly use. I would expect more "writes" than "reads" since NCCL tries to push rather than pull data.

For multi-node you are correct. GDRDMA is used, but a host-memory fall-back is used when the GDRDMA is not available.

anon76595247 · April 19, 2018, 7:08am

Thanks Nathan for writing the blog.

I have an orthogonal question regarding the method you used to measure the bandwidth. What tool can I use to measure bandwidth between various links on a server with 8 GPUs?

Note: I came across bandwidthTest which is part of the CUDA samples but I would like something that can help look at bandwidth usage for all links over time.

anon51691861 · May 11, 2018, 7:20pm

Hi Nathan, what is the profiling tool you used to get the performance link bandwidth achieved by various NCCL collective?

Topic		Replies	Views
Scaling Deep Learning Training with NCCL Technical Blog	1	802	November 6, 2018
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19323	November 26, 2015
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	3515	March 10, 2018
Multi-GPU: A must in HPC? CUDA Programming and Performance	10	8511	February 10, 2010
Massively Scale Your Deep Learning Training with NCCL 2.4 Technical Blog	1	563	February 8, 2020
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2096	February 5, 2020
How to Build a GPU-Accelerated Research Cluster Technical Blog	15	747	May 23, 2016
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1205	May 14, 2019
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1858	June 26, 2023
Cross-vendor GPU development strategy CUDA Programming and Performance	20	6632	January 11, 2010

Fast Multi-GPU collectives with NCCL

Related topics