[SHARP] error in sharp_connect_tree

vitduck · March 28, 2022, 12:54am

Hi,

I am encountering SHARP-related error when testing gromacs/21.3
System: AMD EPYC 7543 and 8 x A100-SXM-80GB
OS: CentOS 7.9.2009 with 3.10.0-1160 kernel
Env: gcc/10.2, cuda/11.4, openmpi 4.1.1 built against both UCX and HCOLL

$ ompi_info
Configure command line: 
    '--with-ucx=/apps/common/ucx/1.11.2'
    '--with-ucx-libdir=/apps/common/ucx/1.11.2/lib'
    '--with-hcoll=/opt/mellanox/hcoll'

[Problem description]

Error only appears when running gromacs on more than 3 nodes.

Nevertheless, calculations proceeded to the end without crashing.

# [gpu30:0:19792 - context.c:763] INFO sharp_job_id:469  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:167 user_data_per_ost:1024 max_groups:167 max_qps:64 max_group_channels:1)
# [gpu30:0:19792 - context.c:767] INFO sharp_job_id:469  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
# [gpu30:0:19792 unique id 47] ERROR AN MAD error in sharp_connect_tree.
# [gpu30:0:19792 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
# [gpu32:128:61292 unique id 12] ERROR AN MAD error in sharp_connect_tree.
# [gpu32:128:61292 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
# [gpu31:64:5267 unique id 10] ERROR AN MAD error in sharp_connect_tree.
# [gpu31:64:5267 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)

mpirun options used during tests

mpirun  
    --map-by numa:PE=8                         
    --mca pml ucx                                     
    --mca coll_hcoll_enable 1                
    -x OMP_NUM_THREADS=8              
    -x HCOLL_ENABLE_SHARP=3        
    -x SHARP_COLL_ENABLE_SAT=1  
    -x SHARP_COLL_LOG_LEVEL=3     
    -x UCX_TLS=dc,sm,self                   
    gmx_mpi                                           
    ...

sharpd status (on gpu30)

sharpd.service - SHARP Daemon (sharpd). Version: 2.5.1.MLNX20210812.e3c2616
   Loaded: loaded (/etc/systemd/system/sharpd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sharpd.service.d
           └─Service.conf
   Active: active (running) since Fri 2022-03-25 13:37:01 KST; 2 days ago
 Main PID: 3930 (sharpd)
    Tasks: 6
   Memory: 61.3M
   CGroup: /system.slice/sharpd.service
           └─3930 /opt/mellanox/sharp/bin/sharpd -P -O -/etc/sharp/sharpd.cfg

sharpd log (on gpu30)

[Mar 28 09:35:40 863847][SD][3934][info ] - waiting for QPAlloc alloc QP response from AN
[Mar 28 09:35:46 877923][SD][3934][info ] - QPAlloc alloc QP response status: 0x6e, mad status: 0x0
[Mar 28 09:35:46 877969][SD][3934][error] - recv AM QPAlloc alloc QP MAD failed 0x6e
[Mar 28 09:35:46 877995][SD][3934][info ] - connect tree QPN slot 0 QPN 0x0
[Mar 28 09:35:46 878001][SD][3934][info ] - connect tree job ID 2884894721 tree ID 63 local QPN 0x11299 AN QPN 0x0 status 18
[Mar 28 09:35:46 878082][SD][3930][info ] - read 8 message length 32 read count 1 opcode 0x90 TID 0x21
[Mar 28 09:35:46 900397][SD][3930][info ] - receiving from client 4
[Mar 28 09:35:46 900417][SD][3930][info ] - client 4 read 40 message length 64 read count 1 opcode 0xc TID 0x22
[Mar 28 09:35:46 900421][SD][3930][info ] - SHARPD_OP_LEAVE_GROUP TID 0x22
[Mar 28 09:35:46 900441][SD][3934][info ] - leave group ID 0x21 tree ID 0 AN QPN 0x82d911
[Mar 28 09:35:46 900456][SD][3934][info ] - leave AN LID 21 group ID 0x21 PKey 0xffff MTU 4 rate 16 SL 0 PLL 18 from tree ID 0 PathRecord
[Mar 28 09:35:46 900463][SD][3934][info ] - AN GroupJoin leave request MAD TID 0x30b

Other observation:
OSU benchmarks did not produce the above error.

The log message from sharp daemon is too cryptic. Suggestions to further diagnose the issue are much appreciated.

Regards.

devendar · May 10, 2022, 3:57pm

What is the MOFED version?
how many HCA per server? is there any binding of the process to HCA? Can you check if you can run with 1 process per
server?

vitduck · May 12, 2022, 12:53am

Hi,

MOFED version:
We are using MLNX_OFED_LINUX-5.4-3.1.0.0
HCA per server:
For HGX-A100, there are 10 HCA per server, i.e. mlx5_{0…9}
Process binding to HCA:
As I understand, the process binding to rail is automatically handled by UCX.
We do not use any specific binding options.
Test with one process per server:
In case of GROMACS, we are using full 64 cores per node, i.e a CPU:GPU ratio of 8:1.
When sharp_connect_tree error appears, the total number of processes is 64x4 = 256.
In any case, we will reduce the number of process per node per your suggestion and see if the problem persists.

Regards.

Topic		Replies	Views
[SHARP] Failed to initialize SHArP collectives Application Accelerator Software	2	2168	May 12, 2022
[SHARP] Error When Running nccl-tests with multi-GPUs per node using SHARP Network Management Products networking	5	402	October 13, 2024
Why Mellanox SHARP didn't improve the performance when running tests with OpenMPI and Deep Learning Distributed Training Benchmark? Software And Drivers	1	446	February 23, 2021
about running cuda on a gpu cluster CUDA Programming and Performance	25	21792	May 31, 2010
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1644	April 24, 2024
Fail to launch CUDA-MPS CUDA Programming and Performance	9	8728	October 26, 2015
MPI running issue using NVIDIA MPS Service on Multi-GPU nodes CUDA Programming and Performance	4	2315	September 16, 2016
MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA RDMA Software For GPU problem , configurations	8	6940	March 28, 2020
Gpu error,How to solve this problem? Jetson Xavier NX kernel	0	1068	October 10, 2020
cuDSS MGMN Distributed node bounds has a wrong left endpoint GPU-Accelerated Libraries cudss	7	139	August 7, 2025

[SHARP] error in sharp_connect_tree

Related topics