I have a question regarding SHARP and NCCL collectives. Is it possible to run multiple streaming aggregations simultaneously? Specifically, I am trying to run Allgather and Reduce-scatter collectives simultaneously using SHARP Streaming Aggregation.
According to the NVIDIA SHARP documentation, SHARP Streaming Aggregation can be executed on a single NCCL communicator/process group:
NCCL SHARP Streaming aggregation is supported on a single NCCL communicator/process group (PG). Applications can selectively enable SHARP on specific Process Group (PG) by setting this variable in the application before creating the PG.
If I run both Allgather and Reduce-Scatter on the same NCCL communicator, is overlapping these operations possible? Any insights or comments would be greatly appreciated!
When you mention that “Allgather does not use SHARP resources,” are you referring to resources related to data reduction, such as the aggregation logic or hardware?
To further clarify my understanding, if I run Allgather and Reduce-Scatter of size S with N GPUs simultaneously using SHARP, here is what I expect:
Operation with SHARP
Data sent by each GPU
Data received by each GPU
Allgather
S
(N-1)S
Reduce-Scatter
(N-1)S
S
Overlap (simultaneous)
NS
NS
If these operations overlap, I believe each GPU would need to handle a total data transfer of NS for both sending and receiving simultaneously. My goal is to achieve this. Could this lead to any potential conflicts?
Depend on how many GPUs, And how many nodes, and IB Switch type, and the link topo, and how many IB HCA each nodes, and how HCA map to GPU, and how your code design.
EG,
If you use NDR 64 port RAIL TO RAIL link, and GPU:HCA 1:1, then each GPU rank can create 1 broadcast communicator (SAT), and the code what ever allreduce/allgather can reuse it.
I suggest if you have such design requirement, then contact NVIDIA Solution Eng team help review design.