Why Mellanox SHARP didn't improve the performance when running tests with OpenMPI and Deep Learning Distributed Training Benchmark?

ductuanpy99 · February 22, 2021, 3:08am

By following the instruction from Mellanox documents, I have successfully install MOFED and Mellanox SHARP on 4 GPU servers. However, when I used OpenMPI to run osu_allreduce with and without Mellanox SHARP, the performance is the same. The below snippets is my command:

Without SHARP:

$ mpirun -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 --allow-run-as-root --bind-to core --map-by node -x LD_LIBRARY_PATH -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x HCOLL_MAIN_IB=mlx5_0:1 ~/hpcx-v2.8.0-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat8.3-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/osu_allreduce -m 0:134217728

# OSU MPI Allreduce Latency Test v5.6.2

Size Avg Latency(us)

…

1048576 1441.46

2097152 2478.80

4194304 4136.40

8388608 9056.91

16777216 19946.72

33554432 46106.70

67108864 92746.40

134217728 185052.29

With SHARP 

$ mpirun -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 --allow-run-as-root -bind-to core --map-by node -x LD_LIBRARY_PATH -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x HCOLL_MAIN_IB=mlx5_0:1 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_LOG_LEVEL=3 -x HCOLL_ENABLE_SHARP=3 ~/hpcx-v2.8.0-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat8.3-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/osu_allreduce -m 0:134217728

OSU MPI Allreduce Latency Test v5.6.2

Size Avg Latency(us)

…

1048576 1339.99

2097152 2293.31

4194304 4241.36

8388608 9110.42

16777216 19284.43

33554432 45360.29

67108864 90132.80

134217728 179329.17

Moreover, when I run distributed training over 4 GPU on RestNet50 Benchmark with and without SHARP. The similar situation happened:

Without SHARP

$ mpirun --allow-run-as-root --tag-output -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x HPCX_SHARP_DIR -x LD_LIBRARY_PATH ~/miniconda3/envs/tuantd/bin/python ~/horovod_repo/examples/pytorch/pytorch_synthetic_benchmark.py

[1,0]:Model: resnet50

[1,0]:Batch size: 32

[1,0]:Number of GPUs: 4

[1,0]:Running warmup…

[1,0]:Running benchmark…

[1,0]:Img/sec per GPU: 242.9 ±1.7

[1,0]:Total img/sec on 4 GPU(s): 971.7 ±6.7

With SHARP

**$** mpirun --allow-run-as-root --tag-output -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x ENABLE_SHARP_COLL=3 -x SHARP_COLL_ENABLE_SAT=3 -x NCCL_COLLNET_ENABLE=3 -x NCCL_ALGO=CollNet -x HPCX_SHARP_DIR -x LD_LIBRARY_PATH ~/miniconda3/envs/tuantd/bin/python ~/horovod_repo/examples/pytorch/pytorch_synthetic_benchmark.py

[1,0]:Model: resnet50

[1,0]:Batch size: 32

[1,0]:Number of GPUs: 4

[1,0]:Running warmup…

[1,0]:Running benchmark…

[1,0]:Img/sec per GPU: 237.0 ±21.2

[1,0]:Total img/sec on 4 GPU(s): 948.1 ±84.8

Can you guys help me?

Best regards,

Tuan

samerka · February 23, 2021, 9:33am

Hi Tuan ,

First of all please install HPC-X 2.7 instead of 2.8 (we have known issue with this version)

https://www.mellanox.com/products/hpc-x-toolkit

Since you are looking for SAT (sharp for high BW) there is need to add specific flags to the CLI

add -x HCOLL_ALLREDUCE_HYBRID_LB=1 -x HCOLL_SHARP_NP=0 -x --SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x SHARP_COLL_JOB_QUOTA_OSTS=32 -x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=4 -x SHARP_COLL_LOG_LEVEL=3 -x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4100
change -x HCOLL_ENABLE_SHARP=3 to -x HCOLL_ENABLE_SHARP=4

3 .remove -mca btl_tcp_if_include ib0

remove -mca pml ob1

Thanks,

Samer

Topic		Replies	Views
SC20 Demo: Maximizing Performance for Distributed Machine Learning and Deep Learning with SHARP Technical Blog	0	324	November 15, 2021
[SHARP] Failed to initialize SHArP collectives Application Accelerator Software	2	2199	May 12, 2022
[SHARP] Error When Running nccl-tests with multi-GPUs per node using SHARP Network Management Products networking	5	502	October 13, 2024
Advancing Performance with NVIDIA SHARP In-Network Computing Technical Blog	1	109	October 25, 2024
[SHARP] error in sharp_connect_tree Application Accelerator Software	2	1616	May 12, 2022
MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA RDMA Software For GPU problem , configurations	8	7122	March 28, 2020
[Question] Is It Possible to Overlap Allgather and ReduceScatter Using SHARP Streaming Aggregation? Network Management Products networking	4	235	October 22, 2024
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	15207	August 9, 2022
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	16273	July 19, 2013
CPU performance problem on Jetson TX1 Jetson TX1	23	3533	October 18, 2021

Why Mellanox SHARP didn't improve the performance when running tests with OpenMPI and Deep Learning Distributed Training Benchmark?

Size Avg Latency(us)

OSU MPI Allreduce Latency Test v5.6.2

Size Avg Latency(us)

Related topics