NCCL allreduce in a high performance DGX A100 cluster

Where can I find conceptual details of how allreduce , reduce-scatter and allgather happens in a cluster of DGX A100 systems ? Does it happen in a hierarchical way ? Is there any tool available to trace the steps ?

nccl will do an evaluation of cluster topology, and run some tests to determine how it will communicate. Some methods are hierarchical, like tree, and some are not, or less, hierachical, like ring. From here:

We now have up to 9 choices for Algorithm x Protocol ({Ring,Tree,CollNet}x{LL,LL128,Simple}) so we have models of each combination and for each size, we estimate how much time each would take, then take the lowest.

I don’t know of “conceptual” documentation, but there is a GTC presentation.

Also, NCCL is open source.

If you use a profiler, you can see what nccl is doing, to some degree.

1 Like