Hi! I am curious about the underlying mechanism of the broadcast operation in NCCL. I have considered two possible implementation approaches:
-
The source node sends messages one-to-one to all destination nodes, effectively achieving the broadcast. In this case, the source would need to send n-1 messages.
-
Utilizing special features of the NVSwitch, where the source only needs to send one broadcast message, and the NVSwitch, acting like a hub, directly broadcasts the message to all destination nodes. In this scenario, the source would only need to send one message.
I have searched for information using keywords such as “nccl broadcast” and “nvswitch broadcast”, but I haven’t been able to find specific technical documentation addressing this. If possible, could anyone provide some detailed technical information on this?
I’m fairly sure it is case 1.
Not all systems that NCCL is usable on have NVSwitch.
NCCL is open source. Furthermore, nsys (the profiler) has specific support for NCCL profiling.
Thank you so much for your response and guidance. I will look into the NCCL codes to find the answers. One more question regarding NVSwitch: I’m curious if it has flooding or broadcasting capabilities similar to a layer 2 switch? It seems that such a feature could bring more benefits to broadcast performance. I think that the answer to this question might not be found in the NCCL source code. Many thanks in advance!
I don’t think NVSwitch behavior is documented to that level.
I should have probably also mentioned that NCCL may use tree-like communication patterns in some cases, and possibly ring-like communication patterns. Therefore a broadcast might be implemented as a set of point-to-point operations, although not all operations may have the same starting point.