I am working with DGX SuperPOD with NVIDIA DGX H100 for characterizing collectives’ run times in the context of analyzing AI workloads. I am specifically interested in all-reduce, all-gather, all-to-all and reduce-scatter operations. Can you please explain the meaning of “size” as reported by NCCL tests for each collective? should we also take into account the “counts (elements)” field when comparing approaches for implementing collectives?
As an example, in the case of all-reduce, we have data in all the GPUs that needs to be reduced and then the reduce result is broadcast to all GPUs. So is “size” the size of the data per GPU (which can basically be different for each GPU) or the total size over all GPUs? is the meaning different for other collectives? Here is an example output for all-to-all for 2 nodes (16 GPUs):
size is number of GPUs times number of elements per GPU times bytes per element. You are moving all the data to all the nodes, so the size of the transfer is that product. It should be similar to a similar description of an equivalent MPI collective.
Looking closer at the code of NCCL tests, this is the way they compute the bandwidth:
All-gather, all-to-all, reduce-scatter: double baseBw = (double)(count * typesize * nranks) / 1.0E9 / sec;
all-reduce: double baseBw = (double)(count * typesize) / 1.0E9 / sec;
- Is it correct that
count * typesize
is the size of the data for a single GPU?
- If this is the case, does it mean that for all-gather, all-to-all and reduce-scatter “size” is the total data on all GPUs where as for all-reduce “size” means the data per single GPU?
Then do I understand correctly that for all-reduce the behavior is indeed different? that is, “size” is per single GPU for all-reduce, whereas “size” is the total size over all GPUs for all-gather, reduce-scatter and all-to-all?