Using nccl-tests with MPI over RoCE. We see poor performance when using all 4x GPUs in each host of the 2 hosts.
Environment
- OS: Red Hat Enterprise Linux release 8.10
- 2 servers with 4x H100 SXM5 and ConnectX-7 2x200GbE in LACP
- MPICH 4.2
- CUDA 12.2
- NCCL: 2.23.4
- Nvidia Driver Version: 555.42.06
- nccl-tests: 9d26b8422ba76c098df996b96e13b8ddf3a71165
Issues
-
GPUDirect RDMA nvidia-peermem not found
We get this issue even if we uninstall the the Nvidia GPU driver and install it after mlx NIC drivers. Suspect GPUDirect RDMA may help with poor nccl-test performance and plan to test by adding additional NICs for each GPU in the future. -
Poor performance running nccl-tests across 2 hosts with all 4 GPUs. Specifically when using GPU0 on both servers which is on the same PCIe switch as the NIC. Suspected some kind of PCIe contention although looking at Intel PCM PCIe bus doesn’t seem to be hitting bandwidth limit of PCIe Gen 5 x16 = ~ 63GB/s.
+-[0000:48]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
| +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
| +-00.2 Intel Corporation Ice Lake RAS
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[49-4e]----00.0-[4a-4e]--+-00.0-[4b]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7]
| | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7]
| +-01.0-[4c]----00.0 NVIDIA Corporation GH100 [H100 SXM5 80GB]
| +-02.0-[4d]--
| \-1f.0-[4e]----00.0 Broadcom / LSI PCIe Switch management endpoint
nccl-test 2x hosts, 4 GPUs per host, 8 GPUs total
# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3362857 on host23 device 0 [0x4c] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 3488077 on host24 device 0 [0x4c] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 3362858 on host23 device 1 [0x5d] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 3488078 on host24 device 1 [0x5d] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 3362859 on host23 device 2 [0xcc] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 3488079 on host24 device 2 [0xcc] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 3362860 on host23 device 3 [0xdc] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 3488080 on host24 device 3 [0xdc] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1073741824 268435456 float none 0 146252 7.34 7.34 0 146298 7.34 7.34 0
2147483648 536870912 float none 0 295494 7.27 7.27 0 299589 7.17 7.17 0
4294967296 1073741824 float none 0 592360 7.25 7.25 0 592323 7.25 7.25 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 7.26972
#
nccl-test 2x hosts, 3 GPUs per host, 6 GPUs total
- Disable using GPU0 via CUDA_VISIBLE_DEVICES=1,2,3
# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3368208 on host23 device 0 [0x5d] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 3494017 on host24 device 0 [0x5d] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 3368209 on host23 device 1 [0xcc] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 3494018 on host24 device 1 [0xcc] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 3368210 on host23 device 2 [0xdc] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 3494019 on host24 device 2 [0xdc] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1073741824 268435456 float none 0 27374 39.22 39.22 0 27373 39.23 39.23 0
2147483648 536870912 float none 0 54768 39.21 39.21 0 54766 39.21 39.21 0
4294967296 1073741824 float none 0 109558 39.20 39.20 0 109558 39.20 39.20 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 39.2131
#