Hi.
I’m trying to trace a simple Fortran MPI test code with nsys. The code runs and terminates correctly without any sort of instrumentation. When I try to profile with the following command-
nsys profile -t mpi srun -n 8 ./mpi_ping
I get a lot of pmix errors that fail to perform collectives. I have the slurm output below
Loading hdf5/1.14.3--openmpi--4.1.6--nvhpc--24.3
Loading requirement: nvhpc/24.3 hpcx/2.18.1--binary openmpi/4.1.6--nvhpc--24.3
zlib/1.3--gcc--12.2.0
slurmstepd: error: TaskProlog failed status=1
slurmstepd: error: TaskProlog failed status=1
slurmstepd: error: TaskProlog failed status=1
slurmstepd: error: TaskProlog failed status=1
srun: error: lrdn2068: tasks 4-7: Exited with exit code 1
slurmstepd: error: mpi/pmix_v3: pmixp_p2p_send: lrdn2066 [0]: pmixp_utils.c:467: send failed, rc=2, exceeded the retry limit
slurmstepd: error: mpi/pmix_v3: _slurm_send: lrdn2066 [0]: pmixp_server.c:1583: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.10332196.0, size = 2733, hostlist:
(null)
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_reset_if_to: lrdn2066 [0]: pmixp_coll_ring.c:742: 0x1530b002c230: collective timeout seq=0
slurmstepd: error: mpi/pmix_v3: pmixp_coll_log: lrdn2066 [0]: pmixp_coll.c:286: Dumping collective state
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:760: 0x1530b002c230: COLL_FENCE_RING state seq=0
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:762: my peerid: 0:lrdn2066
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:769: neighbor id: next 1:lrdn2068, prev 1:lrdn2068
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:779: Context ptr=0x1530b002c2a8, #0, in-use=0
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:779: Context ptr=0x1530b002c2e0, #1, in-use=0
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:779: Context ptr=0x1530b002c318, #2, in-use=1
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:790: seq=0 contribs: loc=1/prev=0/fwd=1
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:792: neighbor contribs [2]:
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:825: done contrib: -
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:827: wait contrib: lrdn2068
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:829: status=PMIXP_COLL_RING_PROGRESS
slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:833: buf (offset/size): 2632/7896
[lrdn2066.leonardo.local:2707346] pml_ucx.c:179 Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707346] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 4
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 2
[lrdn2066.leonardo.local:2707347] pml_ucx.c:179 Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707347] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 5
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 2
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[lrdn2066.leonardo.local:2707348] pml_ucx.c:179 Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707348] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 6
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 2
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[lrdn2066.leonardo.local:2707349] pml_ucx.c:179 Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707349] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 7
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 2
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[lrdn2066.leonardo.local:2707346] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[lrdn2066.leonardo.local:2707347] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[lrdn2066.leonardo.local:2707348] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[lrdn2066.leonardo.local:2707349] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
slurmstepd: error: *** STEP 10332196.0 ON lrdn2066 CANCELLED AT 2024-12-18T10:05:40 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 10332196 ON lrdn2066 CANCELLED AT 2024-12-18T10:05:40 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
This only fails when tracing Fortran applications with collectives. C/C++ applications trace correctly.
I originally intended to trace an MPI + CUDA application, but since it fails with just an MPI test code, the problem seems to be the MPI installation.
I am using an OpenMPI installation built with nvhpc 24.3 as an environment module on a computing cluster.
Is there currently any known issue with tracing MPI applications with UCX enabled?