Nccl-test poor performance

keeshana · October 29, 2024, 8:26pm

Using nccl-tests with MPI over RoCE. We see poor performance when using all 4x GPUs in each host of the 2 hosts.

Environment

OS: Red Hat Enterprise Linux release 8.10
2 servers with 4x H100 SXM5 and ConnectX-7 2x200GbE in LACP
MPICH 4.2
CUDA 12.2
NCCL: 2.23.4
Nvidia Driver Version: 555.42.06
nccl-tests: 9d26b8422ba76c098df996b96e13b8ddf3a71165

Issues

GPUDirect RDMA nvidia-peermem not found
We get this issue even if we uninstall the the Nvidia GPU driver and install it after mlx NIC drivers. Suspect GPUDirect RDMA may help with poor nccl-test performance and plan to test by adding additional NICs for each GPU in the future.
Poor performance running nccl-tests across 2 hosts with all 4 GPUs. Specifically when using GPU0 on both servers which is on the same PCIe switch as the NIC. Suspected some kind of PCIe contention although looking at Intel PCM PCIe bus doesn’t seem to be hitting bandwidth limit of PCIe Gen 5 x16 = ~ 63GB/s.

 +-[0000:48]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
 |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
 |           +-00.2  Intel Corporation Ice Lake RAS
 |           +-00.4  Intel Corporation Device 0b23
 |           \-01.0-[49-4e]----00.0-[4a-4e]--+-00.0-[4b]--+-00.0  Mellanox Technologies MT2910 Family [ConnectX-7]
 |                                           |            \-00.1  Mellanox Technologies MT2910 Family [ConnectX-7]
 |                                           +-01.0-[4c]----00.0  NVIDIA Corporation GH100 [H100 SXM5 80GB]
 |                                           +-02.0-[4d]--
 |                                           \-1f.0-[4e]----00.0  Broadcom / LSI PCIe Switch management endpoint

nccl-test 2x hosts, 4 GPUs per host, 8 GPUs total

# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3362857 on    host23 device  0 [0x4c] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid 3488077 on    host24 device  0 [0x4c] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid 3362858 on    host23 device  1 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid 3488078 on    host24 device  1 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid 3362859 on    host23 device  2 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid 3488079 on    host24 device  2 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid 3362860 on    host23 device  3 [0xdc] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid 3488080 on    host24 device  3 [0xdc] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float    none       0   146252    7.34    7.34      0   146298    7.34    7.34      0
  2147483648     536870912     float    none       0   295494    7.27    7.27      0   299589    7.17    7.17      0
  4294967296    1073741824     float    none       0   592360    7.25    7.25      0   592323    7.25    7.25      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.26972
#

nccl-test 2x hosts, 3 GPUs per host, 6 GPUs total

Disable using GPU0 via CUDA_VISIBLE_DEVICES=1,2,3

# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3368208 on    host23 device  0 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid 3494017 on    host24 device  0 [0x5d] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid 3368209 on    host23 device  1 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid 3494018 on    host24 device  1 [0xcc] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid 3368210 on    host23 device  2 [0xdc] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid 3494019 on    host24 device  2 [0xdc] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float    none       0    27374   39.22   39.22      0    27373   39.23   39.23      0
  2147483648     536870912     float    none       0    54768   39.21   39.21      0    54766   39.21   39.21      0
  4294967296    1073741824     float    none       0   109558   39.20   39.20      0   109558   39.20   39.20      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 39.2131
#

njuffa · October 29, 2024, 8:54pm

What specific CPU is used in this system, and is this a dual-CPU-socket system with two H100s per CPU socket?

What is confusing to me is seeing an “Ice Lake” designation instead of “Sapphire Rapids”. Does Ice Lake even have PCIe5 support? I thought it uses PCIe4 only?

keeshana · October 29, 2024, 9:02pm

It is dual socket system, Sapphire Rapids 8480+. 2x H100s per CPU, PCIE Gen5. Guessing the ICE Lake is just some shared/inherited design components.

njuffa · October 29, 2024, 9:07pm

OK, that makes sense then. In official Intel parlance: Intel® Xeon® Platinum 8480+ Processor

I wouldn’t know what to expect for this benchmark, but the NVIDIA folks should know.

Topic		Replies	Views
Sendrecv_perf nccl-tests - The process needs to be terminated manually - Volatile GPU-Util: 100% Container: HPC cuda , ubuntu	0	63	February 6, 2025
Run HPL on 4x A100 CUDA Programming and Performance	3	3076	July 17, 2021
Only 7 of 8 GPUs are loaded: Dual GRID M40's on HP Proliant DL580 G7 Ubuntu 16.04.5 Server Linux	3	1036	October 12, 2021
7 out of 8 GPUs on G291 with Debian 10 Linux	17	1756	July 30, 2019
RESOLVED!!! \| GPU missing from nvidia-smi but seen in lspci CUDA Setup and Installation	9	13431	April 11, 2024
How can I tell whether NCCL is using PCIe or IB network interface while doing AllReduce? Deep Learning (Training & Inference)	0	736	March 6, 2020
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	39	17368	October 12, 2021
Fast Multi-GPU collectives with NCCL Technical Blog	14	1021	May 11, 2018
NVSHMEM runtime error GPU-Accelerated Libraries nvshmem	11	1866	August 16, 2022
Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems Profiling Linux Targets profiling	12	921	June 5, 2024

Nccl-test poor performance

Related topics