Inquiry about NCCL Broadcast Operation Implementation

user80731 · October 16, 2024, 9:17am

Hi! I am curious about the underlying mechanism of the broadcast operation in NCCL. I have considered two possible implementation approaches:

The source node sends messages one-to-one to all destination nodes, effectively achieving the broadcast. In this case, the source would need to send n-1 messages.
Utilizing special features of the NVSwitch, where the source only needs to send one broadcast message, and the NVSwitch, acting like a hub, directly broadcasts the message to all destination nodes. In this scenario, the source would only need to send one message.

I have searched for information using keywords such as “nccl broadcast” and “nvswitch broadcast”, but I haven’t been able to find specific technical documentation addressing this. If possible, could anyone provide some detailed technical information on this?

Robert_Crovella · October 16, 2024, 1:58pm

I’m fairly sure it is case 1.

Not all systems that NCCL is usable on have NVSwitch.

NCCL is open source. Furthermore, nsys (the profiler) has specific support for NCCL profiling.

user80731 · October 17, 2024, 5:22am

Thank you so much for your response and guidance. I will look into the NCCL codes to find the answers. One more question regarding NVSwitch: I’m curious if it has flooding or broadcasting capabilities similar to a layer 2 switch? It seems that such a feature could bring more benefits to broadcast performance. I think that the answer to this question might not be found in the NCCL source code. Many thanks in advance!

Robert_Crovella · October 17, 2024, 9:00pm

I don’t think NVSwitch behavior is documented to that level.

I should have probably also mentioned that NCCL may use tree-like communication patterns in some cases, and possibly ring-like communication patterns. Therefore a broadcast might be implemented as a set of point-to-point operations, although not all operations may have the same starting point.

Topic		Replies	Views
The NCCL communications on dual cpus and multi gpus GPU-Accelerated Libraries nccl	0	285	January 23, 2024
can NCCL be used in distributed environment? across machines. GPU-Accelerated Libraries	0	469	August 10, 2018
Scaling Deep Learning Training with NCCL Technical Blog	1	811	November 6, 2018
can NCCL be used in distributed environment? across machines. CUDA Programming and Performance	0	448	August 10, 2018
NCCL2 across multiple nodes without MPI? CUDA Programming and Performance	6	3531	January 27, 2025
How to use NCCL to communicate between nodes? CUDA Programming and Performance cuda , openmpi	0	1313	June 19, 2023
Fast Multi-GPU collectives with NCCL Technical Blog	14	1017	May 11, 2018
Doubling all2all Performance with NVIDIA Collective Communication Library 2.12 Technical Blog	0	792	February 28, 2022
How to perform inter-GPU communication using NCCL2 across different hosts without MPI? GPU-Accelerated Libraries	1	889	May 10, 2018
How to use NCCL2 to communicate other server? CUDA Programming and Performance	0	544	August 18, 2017

Inquiry about NCCL Broadcast Operation Implementation

Related topics