Massively Scale Your Deep Learning Training with NCCL 2.4

jwitsoe · February 4, 2019, 3:08pm

Originally published at: Massively Scale Your Deep Learning Training with NCCL 2.4 | NVIDIA Technical Blog

Imagine using tens of thousands of GPUs to train your neural network. Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks, providing optimized, multi-GPU, and multi-machine training. Allreduce operations, used to sum gradients over multiple GPUs, have usually been implemented using rings [1] [2] to achieve full bandwidth. The…

anon56045503 · February 8, 2020, 3:01am

I wonder whether the Flat Ring is composed of a reduce-scatter then an all-gather? Or just a reduce and broadcast?

Topic		Replies	Views
Fast Multi-GPU collectives with NCCL Technical Blog	14	1025	May 11, 2018
Scaling Deep Learning Training with NCCL Technical Blog	1	812	November 6, 2018
Memory Efficiency, Faster Initialization, and Cost Estimation with NVIDIA Collective Communications Library 2.22 Technical Blog	1	42	September 17, 2024
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	642	June 3, 2024
Tips for Optimizing GPU Performance Using Tensor Cores Technical Blog	15	1031	July 24, 2019
Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization Technical Blog	2	1221	April 3, 2022
tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) doesn't work with more than 2 GPUs CUDA Setup and Installation	1	493	October 5, 2024
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2137	February 5, 2020
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf Technical Blog	13	1414	August 11, 2021
Significant speedup of OpenCL vs CUDA CUDA Programming and Performance	23	7951	February 12, 2022

Massively Scale Your Deep Learning Training with NCCL 2.4

Related topics