I’m encountering the issue with the SHARP Aggregation Manager on my system, and I’m not sure how to resolve it.
Sep 04 11:42:31 snail01 sharp_am[720665]: Local Port validation failed. error: FabricProvider must bind to port with master SM (SM LID:35 local LID:2. Exiting.
- OS: Ubuntu 20.04.6 LTS
- HPC-X: hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64
[Problem description]
Before executing sharp_hello, SHARP test script, SHARP Aggregation Manager (sharp_am
) appears to start normally:
tateiwa@snail01:~$ sudo systemctl status sharp_am.service
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.8.0
Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/sharp_am.service.d
└─Service.conf
Active: active (running) since Wed 2024-09-04 11:42:09 JST; 2s ago
Main PID: 720665 (sharp_am)
Tasks: 68 (limit: 309169)
Memory: 21.8M
CGroup: /system.slice/sharp_am.service
└─720665 /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O -/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/conf/sharp_am.cfg
However, after executing sharp_hello, the process fails, and sharp_am
changes its status to:
tateiwa@snail01:~$ sudo service sharp_am status
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.8.0
Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/sharp_am.service.d
└─Service.conf
Active: failed (Result: exit-code) since Wed 2024-09-04 16:35:46 JST; 23min ago
Process: 1406820 ExecStart=/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O $CONF (code=exited, status=255/EXCEPTION)
Main PID: 1406820 (code=exited, status=255/EXCEPTION)
Sep 04 16:35:22 snail01 sharp_am[1406820]: Sharp AM pid: 1406820
Sep 04 16:35:22 snail01 sharp_am[1406820]: Command line: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O -/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/conf/sharp_am.cfg
Sep 04 16:35:24 snail01 sharp_am[1406820]: Built 1 trees.
Sep 04 16:35:44 snail01 sharp_am[1406820]: Local Port validation failed. error: FabricProvider must bind to port with master SM (SM LID:35 local LID:2. Exiting.
Sep 04 16:35:44 snail01 sharp_am[1406820]: signal 15 received from pid: 1406820, process: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Received a graceful termination signal - Stopping sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Shutting down SHARP Aggregation Manager
Sep 04 16:35:46 snail01 sharp_am[1406820]: sharp_am exit. Return code: -1
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Main process exited, code=exited, status=255/EXCEPTION
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Failed with result 'exit-code'.
Here is the sharp_hello execution log.
tateiwa@snail01:~$ $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_1:1 -v 3
[snail01:0:1767129 - context.c:670][2024-09-04 16:57:57] INFO job (ID: 9370541664355057156) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01][Sep 04 16:57:58 828008][SR ][1767129][error] - no AM service record found(SA query)
[snail01][Sep 04 16:57:58 862901][RDMA_SR][1767129][error] - Error event recieved: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -22
[snail01][Sep 04 16:57:58 862963][RDMA_SR][1767129][error] - Error occured during connection event handle
[snail01][Sep 04 16:58:01 866031][RDMA_SR][1767129][error] - poll failed due to poll_timeout=3000.000000, stop
[snail01][Sep 04 16:58:01 866125][RDMA_SR][1767129][error] - Poll failed
[snail01][Sep 04 16:58:01 866158][RDMA_SR][1767129][error] - Failed to connect
[snail01][Sep 04 16:58:01 866554][RDMA_SR][1767129][error] - rdma_resolve_addr failed with error: -1
[snail01][Sep 04 16:58:01 866608][RDMA_SR][1767129][error] - rdma_resolve_addr failed with error: -1
[snail01][Sep 04 16:58:01 866637][SR ][1767129][error] - unable to query AM service record(AM query)
[snail01][Sep 04 16:58:01 866657][GENERAL][1767129][error] - Could not query AM address, error: -52
[snail01][Sep 04 16:58:01 866682][GENERAL][1767129][error] - failed to connect to AM - error -1 received
[snail01][Sep 04 16:58:01 873171][GENERAL][1767129][error] - unable to connect to AM
[snail01][Sep 04 16:58:01 873207][GENERAL][1767129][warn ] - SHARPD_OP_CREATE_JOB failed with status: 53
[snail01:0:1767129 unique id 9370541664355057156][2024-09-04 16:58:01] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job.
Does anyone have any insights on how to resolve this local port validation error? Any suggestions would be greatly appreciated!
Thank you in advance.