Given a communicator of size p, each participant has unidirectional bandwidth B, and message size is S.

The theoretical time (if latency and computation are both ignored) for gather, scatter, all_gather, reduce-scatter is (p-1)/p*(S/B), and for all_reduce, it should be 2*(p-1)/p*S/B, if reduction in network is disabled.

However, if reduction in network is enabled, what is the theoretical time for the above collectives?

I think, for the case of all_reduce, it should be S/B, since for each participant, it only needs to send S, and receive S, and send receive can happen simultaneously (i.e. pipelining). This also means that reduction in network essentially increases the bandwidth B by a factor 2X.

But, how to calculate the theoretical time for the others? how to evaluate the benefit of reduction in network for the others?