Hi,
I am encountering SHARP-related error when testing gromacs/21.3
System: AMD EPYC 7543 and 8 x A100-SXM-80GB
OS: CentOS 7.9.2009 with 3.10.0-1160 kernel
Env: gcc/10.2, cuda/11.4, openmpi 4.1.1 built against both UCX and HCOLL
$ ompi_info
Configure command line:
'--with-ucx=/apps/common/ucx/1.11.2'
'--with-ucx-libdir=/apps/common/ucx/1.11.2/lib'
'--with-hcoll=/opt/mellanox/hcoll'
[Problem description]
-
Error only appears when running gromacs on more than 3 nodes.
Nevertheless, calculations proceeded to the end without crashing.
# [gpu30:0:19792 - context.c:763] INFO sharp_job_id:469 tree_type:LLT tree_idx:0 treeID:0x0 caps:0x6 quota:(osts:167 user_data_per_ost:1024 max_groups:167 max_qps:64 max_group_channels:1)
# [gpu30:0:19792 - context.c:767] INFO sharp_job_id:469 tree_type:SAT tree_idx:1 treeID:0x3f caps:0x16
# [gpu30:0:19792 unique id 47] ERROR AN MAD error in sharp_connect_tree.
# [gpu30:0:19792 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
# [gpu32:128:61292 unique id 12] ERROR AN MAD error in sharp_connect_tree.
# [gpu32:128:61292 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
# [gpu31:64:5267 unique id 10] ERROR AN MAD error in sharp_connect_tree.
# [gpu31:64:5267 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
- mpirun options used during tests
mpirun
--map-by numa:PE=8
--mca pml ucx
--mca coll_hcoll_enable 1
-x OMP_NUM_THREADS=8
-x HCOLL_ENABLE_SHARP=3
-x SHARP_COLL_ENABLE_SAT=1
-x SHARP_COLL_LOG_LEVEL=3
-x UCX_TLS=dc,sm,self
gmx_mpi
...
- sharpd status (on gpu30)
sharpd.service - SHARP Daemon (sharpd). Version: 2.5.1.MLNX20210812.e3c2616
Loaded: loaded (/etc/systemd/system/sharpd.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/sharpd.service.d
└─Service.conf
Active: active (running) since Fri 2022-03-25 13:37:01 KST; 2 days ago
Main PID: 3930 (sharpd)
Tasks: 6
Memory: 61.3M
CGroup: /system.slice/sharpd.service
└─3930 /opt/mellanox/sharp/bin/sharpd -P -O -/etc/sharp/sharpd.cfg
- sharpd log (on gpu30)
[Mar 28 09:35:40 863847][SD][3934][info ] - waiting for QPAlloc alloc QP response from AN
[Mar 28 09:35:46 877923][SD][3934][info ] - QPAlloc alloc QP response status: 0x6e, mad status: 0x0
[Mar 28 09:35:46 877969][SD][3934][error] - recv AM QPAlloc alloc QP MAD failed 0x6e
[Mar 28 09:35:46 877995][SD][3934][info ] - connect tree QPN slot 0 QPN 0x0
[Mar 28 09:35:46 878001][SD][3934][info ] - connect tree job ID 2884894721 tree ID 63 local QPN 0x11299 AN QPN 0x0 status 18
[Mar 28 09:35:46 878082][SD][3930][info ] - read 8 message length 32 read count 1 opcode 0x90 TID 0x21
[Mar 28 09:35:46 900397][SD][3930][info ] - receiving from client 4
[Mar 28 09:35:46 900417][SD][3930][info ] - client 4 read 40 message length 64 read count 1 opcode 0xc TID 0x22
[Mar 28 09:35:46 900421][SD][3930][info ] - SHARPD_OP_LEAVE_GROUP TID 0x22
[Mar 28 09:35:46 900441][SD][3934][info ] - leave group ID 0x21 tree ID 0 AN QPN 0x82d911
[Mar 28 09:35:46 900456][SD][3934][info ] - leave AN LID 21 group ID 0x21 PKey 0xffff MTU 4 rate 16 SL 0 PLL 18 from tree ID 0 PathRecord
[Mar 28 09:35:46 900463][SD][3934][info ] - AN GroupJoin leave request MAD TID 0x30b
- Other observation:
OSU benchmarks did not produce the above error.
The log message from sharp daemon is too cryptic. Suggestions to further diagnose the issue are much appreciated.
Regards.