Massively Scale Your Deep Learning Training with NCCL 2.4

Originally published at: https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/

Imagine using tens of thousands of GPUs to train your neural network. Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks, providing optimized, multi-GPU, and multi-machine training. Allreduce operations, used to sum gradients over multiple GPUs, have usually been implemented using rings [1] [2] to achieve full bandwidth. The…

I wonder whether the Flat Ring is composed of a reduce-scatter then an all-gather? Or just a reduce and broadcast?