Meaning of "size" in NCCL tests

midterm07 · April 16, 2024, 10:49am

I am working with DGX SuperPOD with NVIDIA DGX H100 for characterizing collectives’ run times in the context of analyzing AI workloads. I am specifically interested in all-reduce, all-gather, all-to-all and reduce-scatter operations. Can you please explain the meaning of “size” as reported by NCCL tests for each collective? should we also take into account the “counts (elements)” field when comparing approaches for implementing collectives?

As an example, in the case of all-reduce, we have data in all the GPUs that needs to be reduced and then the reduce result is broadcast to all GPUs. So is “size” the size of the data per GPU (which can basically be different for each GPU) or the total size over all GPUs? is the meaning different for other collectives? Here is an example output for all-to-all for 2 nodes (16 GPUs):

Robert_Crovella · April 16, 2024, 4:42pm

size is number of GPUs times number of elements per GPU times bytes per element. You are moving all the data to all the nodes, so the size of the transfer is that product. It should be similar to a similar description of an equivalent MPI collective.

midterm07 · April 17, 2024, 8:12am

Looking closer at the code of NCCL tests, this is the way they compute the bandwidth:
All-gather, all-to-all, reduce-scatter: double baseBw = (double)(count * typesize * nranks) / 1.0E9 / sec;
all-reduce: double baseBw = (double)(count * typesize) / 1.0E9 / sec;

Is it correct that count * typesize is the size of the data for a single GPU?
If this is the case, does it mean that for all-gather, all-to-all and reduce-scatter “size” is the total data on all GPUs where as for all-reduce “size” means the data per single GPU?

Robert_Crovella · April 17, 2024, 4:29pm

– —>

midterm07 · April 17, 2024, 5:14pm

Then do I understand correctly that for all-reduce the behavior is indeed different? that is, “size” is per single GPU for all-reduce, whereas “size” is the total size over all GPUs for all-gather, reduce-scatter and all-to-all?

Topic		Replies	Views
Why All-gather write (n-1)/n of communication data for 2 times？ General Topics and Other SDKs	0	13	November 7, 2024
Fast Multi-GPU collectives with NCCL Technical Blog	14	977	May 11, 2018
Scaling Deep Learning Training with NCCL Technical Blog	1	802	November 6, 2018
NCCL allreduce in a high performance DGX A100 cluster GPU-Accelerated Libraries nccl	1	297	May 18, 2024
nccl - can we sum up all the values of an array on 1 device GPU to obtain the sum CUDA Programming and Performance	1	526	September 4, 2017
Massively Scale Your Deep Learning Training with NCCL 2.4 Technical Blog	1	563	February 8, 2020
What are allreduce and bidirection bandwidth? CUDA Programming and Performance	1	598	April 9, 2024
local / global work (group) sizes and memory limit calculations How to find out how much private mem CUDA Programming and Performance	3	21937	August 15, 2011
oclHistogram sample. Don't understand shared memory restrictions.... CUDA Programming and Performance	11	5249	March 27, 2011
Questions about global and local work size CUDA Programming and Performance	23	55360	November 1, 2010

Meaning of "size" in NCCL tests

Related topics