[SHARP] Aggregation Manager Fails: Local Port validation failed

I’m encountering the issue with the SHARP Aggregation Manager on my system, and I’m not sure how to resolve it.

Sep 04 11:42:31 snail01 sharp_am[720665]: Local Port validation failed. error: FabricProvider must bind to port with master SM (SM LID:35 local LID:2. Exiting.

  • OS: Ubuntu 20.04.6 LTS
  • HPC-X: hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64

[Problem description]

Before executing sharp_hello, SHARP test script, SHARP Aggregation Manager (sharp_am ) appears to start normally:

tateiwa@snail01:~$ sudo systemctl status sharp_am.service
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.8.0
     Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/sharp_am.service.d
             └─Service.conf
     Active: active (running) since Wed 2024-09-04 11:42:09 JST; 2s ago
   Main PID: 720665 (sharp_am)
      Tasks: 68 (limit: 309169)
     Memory: 21.8M
     CGroup: /system.slice/sharp_am.service
             └─720665 /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O -/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/conf/sharp_am.cfg

However, after executing sharp_hello, the process fails, and sharp_am changes its status to:

tateiwa@snail01:~$ sudo service sharp_am status
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.8.0
     Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/sharp_am.service.d
             └─Service.conf
     Active: failed (Result: exit-code) since Wed 2024-09-04 16:35:46 JST; 23min ago
    Process: 1406820 ExecStart=/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O $CONF (code=exited, status=255/EXCEPTION)
   Main PID: 1406820 (code=exited, status=255/EXCEPTION)

Sep 04 16:35:22 snail01 sharp_am[1406820]: Sharp AM pid: 1406820
Sep 04 16:35:22 snail01 sharp_am[1406820]: Command line: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O -/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/conf/sharp_am.cfg
Sep 04 16:35:24 snail01 sharp_am[1406820]: Built 1 trees.
Sep 04 16:35:44 snail01 sharp_am[1406820]: Local Port validation failed. error: FabricProvider must bind to port with master SM (SM LID:35 local LID:2. Exiting.
Sep 04 16:35:44 snail01 sharp_am[1406820]: signal 15 received from pid: 1406820, process: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Received a graceful termination signal - Stopping sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Shutting down SHARP Aggregation Manager
Sep 04 16:35:46 snail01 sharp_am[1406820]: sharp_am exit. Return code: -1
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Main process exited, code=exited, status=255/EXCEPTION
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Failed with result 'exit-code'.

Here is the sharp_hello execution log.

tateiwa@snail01:~$ $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_1:1 -v 3
[snail01:0:1767129 - context.c:670][2024-09-04 16:57:57] INFO job (ID: 9370541664355057156) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01][Sep 04 16:57:58 828008][SR     ][1767129][error] - no AM service record found(SA query)
[snail01][Sep 04 16:57:58 862901][RDMA_SR][1767129][error] - Error event recieved: event: RDMA_CM_EVENT_ROUTE_ERROR,  error: -22
[snail01][Sep 04 16:57:58 862963][RDMA_SR][1767129][error] - Error occured during connection event handle
[snail01][Sep 04 16:58:01 866031][RDMA_SR][1767129][error] - poll failed due to poll_timeout=3000.000000, stop
[snail01][Sep 04 16:58:01 866125][RDMA_SR][1767129][error] - Poll failed
[snail01][Sep 04 16:58:01 866158][RDMA_SR][1767129][error] - Failed to connect
[snail01][Sep 04 16:58:01 866554][RDMA_SR][1767129][error] - rdma_resolve_addr failed with error: -1
[snail01][Sep 04 16:58:01 866608][RDMA_SR][1767129][error] - rdma_resolve_addr failed with error: -1
[snail01][Sep 04 16:58:01 866637][SR     ][1767129][error] - unable to query AM service record(AM query)
[snail01][Sep 04 16:58:01 866657][GENERAL][1767129][error] - Could not query AM address, error: -52
[snail01][Sep 04 16:58:01 866682][GENERAL][1767129][error] - failed to connect to AM - error -1 received
[snail01][Sep 04 16:58:01 873171][GENERAL][1767129][error] - unable to connect to AM
[snail01][Sep 04 16:58:01 873207][GENERAL][1767129][warn ] - SHARPD_OP_CREATE_JOB failed with status: 53
[snail01:0:1767129 unique id 9370541664355057156][2024-09-04 16:58:01] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job.

Does anyone have any insights on how to resolve this local port validation error? Any suggestions would be greatly appreciated!

Thank you in advance.

Could you check if you start sharpam from Master SM node?

Thanks,