[Question] Is It Possible to Overlap Allgather and ReduceScatter Using SHARP Streaming Aggregation?

Hello,

I have a question regarding SHARP and NCCL collectives. Is it possible to run multiple streaming aggregations simultaneously? Specifically, I am trying to run Allgather and Reduce-scatter collectives simultaneously using SHARP Streaming Aggregation.

According to the NVIDIA SHARP documentation, SHARP Streaming Aggregation can be executed on a single NCCL communicator/process group:

NCCL SHARP Streaming aggregation is supported on a single NCCL communicator/process group (PG). Applications can selectively enable SHARP on specific Process Group (PG) by setting this variable in the application before creating the PG.

If I run both Allgather and Reduce-Scatter on the same NCCL communicator, is overlapping these operations possible? Any insights or comments would be greatly appreciated!

Thank you!

Yes, because allgather doesn’t use Sharp resource.
Which mean when you run sharp all reduce, you can run any other jobs only with normal NCCL traffic.

Thank you for the clarification.

When you mention that “Allgather does not use SHARP resources,” are you referring to resources related to data reduction, such as the aggregation logic or hardware?

To further clarify my understanding, if I run Allgather and Reduce-Scatter of size S with N GPUs simultaneously using SHARP, here is what I expect:

Operation with SHARP Data sent by each GPU Data received by each GPU
Allgather S (N-1)S
Reduce-Scatter (N-1)S S
Overlap (simultaneous) NS NS

If these operations overlap, I believe each GPU would need to handle a total data transfer of NS for both sending and receiving simultaneously. My goal is to achieve this. Could this lead to any potential conflicts?

I appreciate your feedback!

The answer is depend.

Depend on how many GPUs, And how many nodes, and IB Switch type, and the link topo, and how many IB HCA each nodes, and how HCA map to GPU, and how your code design.

EG,

If you use NDR 64 port RAIL TO RAIL link, and GPU:HCA 1:1, then each GPU rank can create 1 broadcast communicator (SAT), and the code what ever allreduce/allgather can reuse it.

I suggest if you have such design requirement, then contact NVIDIA Solution Eng team help review design.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.