Does NCCL2.4 uses Hierarchical All-Reduce for multi-node multi-gpu all-reduce by default?
Hierarchical All-Reduce consists of three stages:
- Intra-node reduce
- Inter-node all-reduce
- Intra-node broadcast
Thanks a lot!
Does NCCL2.4 uses Hierarchical All-Reduce for multi-node multi-gpu all-reduce by default?
Hierarchical All-Reduce consists of three stages:
Thanks a lot!