Reducing or synchronizing profiler overhead with MPI

simonbyrne1 · September 6, 2022, 4:42pm

I’m using Nsight Systems to profile an MPI job. I have an MPI halo exchange operation every 3ms or so, but there is a pause from the profiler overhead of around 0.8ms every 48ms. As this happens at different times on different processes, it breaks the synchronization of the processes, causing long pauses in my MPI_Waitall operations, which breaks scaling at large numbers of processes.

Does anyone have tips or advice to either reduce the profiler overhead, or enforce the overhead to run at the same time on all processes (e.g. by manually triggering it)?

rdietrich · September 6, 2022, 5:47pm

MPI job sounds like you are using a job scheduler. What does the resource allocation look like? Nsight Systems uses some background threads. Could you try allocating more threads than required for the actual MPI program? Since you care about scalability, I assume that you are prefixing mpirun before nsys profile, right?

Could you please provide a few more details on the execution? What arguments are you passing to the nsys profile command?

simonbyrne1 · September 6, 2022, 9:34pm

Thanks for the advice. I’m running on a Slurm cluster on 1 node, with 8 tasks per node.

I’m running a Julia program, which I’m launching it as:

mpiexec nsys profile --sample=none --trace=nvtx,mpi,osrt --mpi-impl=mpich --output=${job_id}/report.%q{NPROCS}.%q{PMI_RANK} julia --project scriptname.jl

Adding 2 cpus-per-task seems to help a bit: e.g. rank 5 seems to be able to run the profiler on a different thread

simonbyrne1 · September 6, 2022, 9:38pm

(splitting as it won’t let me post more than 1 pic per post)

However this doesn’t happen on all ranks, e.g. rank 4

rdietrich · September 7, 2022, 6:36am

AFAIK, there is no API which flushes buffers to explicitly control when profiling overhead should occur. It might be worth to consider adding a function cuProfilerFlush or similar. For now we can just try to figure out what causes the overhead and what helps to reduce it.

As already written, it would be good to have one extra thread/CPU core allocated per program process. From the screenshots there is nothing which gives us a hint what’s happening. For rank 5, we can at least see that the [NSys] process with the Profiling overhead row is executed on another process than MPI rank 5 itself. You can see this by looking at the color in the middle bar (brown in MPI Rank 5 row and pink in [6743] [NSys] row.

You can also try to disable sampling and CPU context switch tracing to see if that reduces the overhead (-s none --cpuctxsw=none). If osrt doesn’t help you, try to remove it from the trace options.

Hint: You can also load multiple reports into the same timeline via File → Open and then selecting multiple report files. If you are running on a single node only, you can also prefix nsys profile [args] before mpiexec to write everything into a single report file.

simonbyrne1 · September 7, 2022, 5:19pm

Thank you for the hints, I’ll try and report back.

I was doing this originally, but for some reason this hid the overhead costs, which is why I switched around.

simonbyrne1 · September 15, 2022, 12:00am

To follow up on this, a solution appears to be to (a) request 2 cpus per task (--cpus-per-task=2) and then launch the MPI job with srun --cpu-bind=cores (I found the hint in the NERSC documentation: Profiling - NERSC Development System Documentation)

Topic		Replies	Views
Long overhead with cuStreamSynchronize with OMPI Profiling Linux Targets nsight , openmpi	13	1528	September 15, 2021
MPI_Waitall profiled multiple times Profiling Linux Targets	4	361	March 11, 2024
Nsys Profile with MPMD(multiple program and multiple data) simulation Profiling Linux Targets nsight , openmpi	6	1511	May 20, 2021
Option to profile only master process Nsight Compute cuda	23	3521	December 1, 2023
Nsight Profiler Hangs on OpenMP Initialization Profiling Linux Targets profiling	9	1200	February 29, 2024
Unable to trace Fortran MPI codes with collectives Profiling Linux Targets openmpi	3	60	December 19, 2024
Profiling overhead Nsight Compute	7	2161	January 27, 2022
Nsight Compute with MPI: ‘No Kernels Were Profiled’ Warning and Hanging Issue Nsight Compute	3	118	March 31, 2025
Slurm fails with multiple processes MPI_init errors , PML add procs failed Base Command Manager	7	2055	October 21, 2022
For a problem where duration on Nsys report and the actual application runtime are too different Profiling Linux Targets	8	990	April 26, 2024

Reducing or synchronizing profiler overhead with MPI

Related topics