Why Mellanox SHARP didn't improve the performance when running tests with OpenMPI and Deep Learning Distributed Training Benchmark?

By following the instruction from Mellanox documents, I have successfully install MOFED and Mellanox SHARP on 4 GPU servers. However, when I used OpenMPI to run osu_allreduce with and without Mellanox SHARP, the performance is the same. The below snippets is my command:

​Without SHARP:

$ mpirun -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 --allow-run-as-root --bind-to core --map-by node -x LD_LIBRARY_PATH -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x HCOLL_MAIN_IB=mlx5_0:1 ~/hpcx-v2.8.0-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat8.3-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/osu_allreduce -m 0:134217728

# OSU MPI Allreduce Latency Test v5.6.2

Size Avg Latency(us)

1048576 1441.46

2097152 2478.80

4194304 4136.40

8388608 9056.91

16777216 19946.72

33554432 46106.70

67108864 92746.40

134217728 185052.29

With SHARP ​

$ mpirun -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 --allow-run-as-root -bind-to core --map-by node -x LD_LIBRARY_PATH -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x HCOLL_MAIN_IB=mlx5_0:1 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_LOG_LEVEL=3 -x HCOLL_ENABLE_SHARP=3 ~/hpcx-v2.8.0-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat8.3-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/osu_allreduce -m 0:134217728

OSU MPI Allreduce Latency Test v5.6.2

Size Avg Latency(us)

1048576 1339.99

2097152 2293.31

4194304 4241.36

8388608 9110.42

16777216 19284.43

33554432 45360.29

67108864 90132.80

134217728 179329.17

Moreover, when I run distributed training over 4 GPU on RestNet50 Benchmark with and without SHARP. The similar situation happened:

Without SHARP

$ mpirun --allow-run-as-root --tag-output -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x HPCX_SHARP_DIR -x LD_LIBRARY_PATH ~/miniconda3/envs/tuantd/bin/python ~/horovod_repo/examples/pytorch/pytorch_synthetic_benchmark.py

[1,0]:Model: resnet50

[1,0]:Batch size: 32

[1,0]:Number of GPUs: 4

[1,0]:Running warmup…

[1,0]:Running benchmark…

[1,0]:Img/sec per GPU: 242.9 ±1.7

[1,0]:Total img/sec on 4 GPU(s): 971.7 ±6.7

​With SHARP

​**$** mpirun --allow-run-as-root --tag-output -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x ENABLE_SHARP_COLL=3 -x SHARP_COLL_ENABLE_SAT=3 -x NCCL_COLLNET_ENABLE=3 -x NCCL_ALGO=CollNet -x HPCX_SHARP_DIR -x LD_LIBRARY_PATH ~/miniconda3/envs/tuantd/bin/python ~/horovod_repo/examples/pytorch/pytorch_synthetic_benchmark.py

[1,0]:Model: resnet50

[1,0]:Batch size: 32

[1,0]:Number of GPUs: 4

[1,0]:Running warmup…

[1,0]:Running benchmark…

[1,0]:Img/sec per GPU: 237.0 ±21.2

[1,0]:Total img/sec on 4 GPU(s): 948.1 ±84.8

Can you guys help me?

Best regards,

Tuan​

Hi Tuan ,

First of all please install HPC-X 2.7 instead of 2.8 (we have known issue with this version)

https://www.mellanox.com/products/hpc-x-toolkit

Since you are looking for SAT (sharp for high BW) there is need to add specific flags to the CLI

  1. add -x HCOLL_ALLREDUCE_HYBRID_LB=1 -x HCOLL_SHARP_NP=0 -x --SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=1024 -x SHARP_COLL_JOB_QUOTA_OSTS=32 -x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=4 -x SHARP_COLL_LOG_LEVEL=3 -x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4100

  2. change -x HCOLL_ENABLE_SHARP=3 to -x HCOLL_ENABLE_SHARP=4

3 .remove -mca btl_tcp_if_include ib0

  1. remove -mca pml ob1

Thanks,

Samer