Nccl-test poor performance

Using nccl-tests with MPI over RoCE. We see poor performance when using all 4x GPUs in each host of the 2 hosts.

Environment

  • OS: Red Hat Enterprise Linux release 8.10
  • 2 servers with 4x H100 SXM5 and ConnectX-7 2x200GbE in LACP
  • MPICH 4.2
  • CUDA 12.2
  • NCCL: 2.23.4
  • Nvidia Driver Version: 555.42.06
  • nccl-tests: 9d26b8422ba76c098df996b96e13b8ddf3a71165

Issues

  • GPUDirect RDMA nvidia-peermem not found
    We get this issue even if we uninstall the the Nvidia GPU driver and install it after mlx NIC drivers. Suspect GPUDirect RDMA may help with poor nccl-test performance and plan to test by adding additional NICs for each GPU in the future.

  • Poor performance running nccl-tests across 2 hosts with all 4 GPUs. Specifically when using GPU0 on both servers which is on the same PCIe switch as the NIC. Suspected some kind of PCIe contention although looking at Intel PCM PCIe bus doesn’t seem to be hitting bandwidth limit of PCIe Gen 5 x16 = ~ 63GB/s.

 +-[0000:48]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
 |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
 |           +-00.2  Intel Corporation Ice Lake RAS
 |           +-00.4  Intel Corporation Device 0b23
 |           \-01.0-[49-4e]----00.0-[4a-4e]--+-00.0-[4b]--+-00.0  Mellanox Technologies MT2910 Family [ConnectX-7]
 |                                           |            \-00.1  Mellanox Technologies MT2910 Family [ConnectX-7]
 |                                           +-01.0-[4c]----00.0  NVIDIA Corporation GH100 [H100 SXM5 80GB]
 |                                           +-02.0-[4d]--
 |                                           \-1f.0-[4e]----00.0  Broadcom / LSI PCIe Switch management endpoint

nccl-test 2x hosts, 4 GPUs per host, 8 GPUs total

# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3362857 on    host23 device  0 [0x4c] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid 3488077 on    host24 device  0 [0x4c] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid 3362858 on    host23 device  1 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid 3488078 on    host24 device  1 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid 3362859 on    host23 device  2 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid 3488079 on    host24 device  2 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid 3362860 on    host23 device  3 [0xdc] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid 3488080 on    host24 device  3 [0xdc] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float    none       0   146252    7.34    7.34      0   146298    7.34    7.34      0
  2147483648     536870912     float    none       0   295494    7.27    7.27      0   299589    7.17    7.17      0
  4294967296    1073741824     float    none       0   592360    7.25    7.25      0   592323    7.25    7.25      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.26972
#

nccl-test 2x hosts, 3 GPUs per host, 6 GPUs total

  • Disable using GPU0 via CUDA_VISIBLE_DEVICES=1,2,3
# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3368208 on    host23 device  0 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid 3494017 on    host24 device  0 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid 3368209 on    host23 device  1 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid 3494018 on    host24 device  1 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid 3368210 on    host23 device  2 [0xdc] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid 3494019 on    host24 device  2 [0xdc] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float    none       0    27374   39.22   39.22      0    27373   39.23   39.23      0
  2147483648     536870912     float    none       0    54768   39.21   39.21      0    54766   39.21   39.21      0
  4294967296    1073741824     float    none       0   109558   39.20   39.20      0   109558   39.20   39.20      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 39.2131
#

What specific CPU is used in this system, and is this a dual-CPU-socket system with two H100s per CPU socket?

What is confusing to me is seeing an “Ice Lake” designation instead of “Sapphire Rapids”. Does Ice Lake even have PCIe5 support? I thought it uses PCIe4 only?

It is dual socket system, Sapphire Rapids 8480+. 2x H100s per CPU, PCIE Gen5. Guessing the ICE Lake is just some shared/inherited design components.

OK, that makes sense then. In official Intel parlance: Intel® Xeon® Platinum 8480+ Processor

I wouldn’t know what to expect for this benchmark, but the NVIDIA folks should know.