Unable to trace Fortran MPI codes with collectives

Hi.

I’m trying to trace a simple Fortran MPI test code with nsys. The code runs and terminates correctly without any sort of instrumentation. When I try to profile with the following command-

nsys profile -t mpi srun -n 8 ./mpi_ping

I get a lot of pmix errors that fail to perform collectives. I have the slurm output below

Loading hdf5/1.14.3--openmpi--4.1.6--nvhpc--24.3
  Loading requirement: nvhpc/24.3 hpcx/2.18.1--binary openmpi/4.1.6--nvhpc--24.3
    zlib/1.3--gcc--12.2.0
slurmstepd: error: TaskProlog failed status=1
slurmstepd: error: TaskProlog failed status=1
slurmstepd: error: TaskProlog failed status=1
slurmstepd: error: TaskProlog failed status=1
srun: error: lrdn2068: tasks 4-7: Exited with exit code 1
slurmstepd: error:  mpi/pmix_v3: pmixp_p2p_send: lrdn2066 [0]: pmixp_utils.c:467: send failed, rc=2, exceeded the retry limit
slurmstepd: error:  mpi/pmix_v3: _slurm_send: lrdn2066 [0]: pmixp_server.c:1583: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.10332196.0, size = 2733, hostlist:
(null)
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_reset_if_to: lrdn2066 [0]: pmixp_coll_ring.c:742: 0x1530b002c230: collective timeout seq=0
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_log: lrdn2066 [0]: pmixp_coll.c:286: Dumping collective state
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:760: 0x1530b002c230: COLL_FENCE_RING state seq=0
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:762: my peerid: 0:lrdn2066
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:769: neighbor id: next 1:lrdn2068, prev 1:lrdn2068
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:779: Context ptr=0x1530b002c2a8, #0, in-use=0
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:779: Context ptr=0x1530b002c2e0, #1, in-use=0
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:779: Context ptr=0x1530b002c318, #2, in-use=1
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:790:       seq=0 contribs: loc=1/prev=0/fwd=1
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:792:       neighbor contribs [2]:
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:825:               done contrib: -
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:827:               wait contrib: lrdn2068
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:829:       status=PMIXP_COLL_RING_PROGRESS
slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: lrdn2066 [0]: pmixp_coll_ring.c:833:       buf (offset/size): 2632/7896
[lrdn2066.leonardo.local:2707346] pml_ucx.c:179  Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707346] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 4
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 2

[lrdn2066.leonardo.local:2707347] pml_ucx.c:179  Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707347] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 5
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 2

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[lrdn2066.leonardo.local:2707348] pml_ucx.c:179  Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707348] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 6
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 2

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[lrdn2066.leonardo.local:2707349] pml_ucx.c:179  Error: Failed to receive UCX worker address: Not found (-13)
[lrdn2066.leonardo.local:2707349] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 7
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 2

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[lrdn2066.leonardo.local:2707346] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[lrdn2066.leonardo.local:2707347] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[lrdn2066.leonardo.local:2707348] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[lrdn2066.leonardo.local:2707349] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
slurmstepd: error: *** STEP 10332196.0 ON lrdn2066 CANCELLED AT 2024-12-18T10:05:40 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 10332196 ON lrdn2066 CANCELLED AT 2024-12-18T10:05:40 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.

This only fails when tracing Fortran applications with collectives. C/C++ applications trace correctly.
I originally intended to trace an MPI + CUDA application, but since it fails with just an MPI test code, the problem seems to be the MPI installation.

I am using an OpenMPI installation built with nvhpc 24.3 as an environment module on a computing cluster.

Is there currently any known issue with tracing MPI applications with UCX enabled?

I don’t know of any, but @rdietrich is the expert on both MPI and UCX trace, so I’ll let him answer.

Since profiling of C/C++ application works, I guess that the run is on a single node.
Do you run into the same issue using mpirun instead of srun?
Do you run into the same issue with srun -n 8 nsys profile -t mpi -o rep%q{SLURM_PROCID} ./mpi_ping?

If possible, please update Nsight Systems to a newer version. 2024.6.1 contains a bug fix for MPI Fortran profiling (MPI_IN_PLACE was not handled correctly), which also might cause the issue you’re seeing.

The run is on two nodes with 4 tasks each. Single node runs with multiple tasks run and trace correctly with both srun and mpirun.
It seems to be a problem with tracing over the interconnect.

Here’s what I get what I run mpirun over 2 nodes:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

Unfortunately, tracing each mpi rank also fails with the following error

srun: error: Task launch for StepId=10369221.0 failed on node lrdn2298: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 10369221.0 ON lrdn0088 CANCELLED AT 2024-12-19T11:25:07 ***
srun: error: lrdn0088: tasks 0-3: Killed

Also, I am now using Nsight Systems 2024.7.1