Reducing or synchronizing profiler overhead with MPI

I’m using Nsight Systems to profile an MPI job. I have an MPI halo exchange operation every 3ms or so, but there is a pause from the profiler overhead of around 0.8ms every 48ms. As this happens at different times on different processes, it breaks the synchronization of the processes, causing long pauses in my MPI_Waitall operations, which breaks scaling at large numbers of processes.

Does anyone have tips or advice to either reduce the profiler overhead, or enforce the overhead to run at the same time on all processes (e.g. by manually triggering it)?

MPI job sounds like you are using a job scheduler. What does the resource allocation look like? Nsight Systems uses some background threads. Could you try allocating more threads than required for the actual MPI program? Since you care about scalability, I assume that you are prefixing mpirun before nsys profile, right?

Could you please provide a few more details on the execution? What arguments are you passing to the nsys profile command?

Thanks for the advice. I’m running on a Slurm cluster on 1 node, with 8 tasks per node.

I’m running a Julia program, which I’m launching it as:

mpiexec nsys profile --sample=none --trace=nvtx,mpi,osrt --mpi-impl=mpich --output=${job_id}/report.%q{NPROCS}.%q{PMI_RANK} julia --project scriptname.jl

Adding 2 cpus-per-task seems to help a bit: e.g. rank 5 seems to be able to run the profiler on a different thread

(splitting as it won’t let me post more than 1 pic per post)

However this doesn’t happen on all ranks, e.g. rank 4

AFAIK, there is no API which flushes buffers to explicitly control when profiling overhead should occur. It might be worth to consider adding a function cuProfilerFlush or similar. For now we can just try to figure out what causes the overhead and what helps to reduce it.

As already written, it would be good to have one extra thread/CPU core allocated per program process. From the screenshots there is nothing which gives us a hint what’s happening. For rank 5, we can at least see that the [NSys] process with the Profiling overhead row is executed on another process than MPI rank 5 itself. You can see this by looking at the color in the middle bar (brown in MPI Rank 5 row and pink in [6743] [NSys] row.

You can also try to disable sampling and CPU context switch tracing to see if that reduces the overhead (-s none --cpuctxsw=none). If osrt doesn’t help you, try to remove it from the trace options.

Hint: You can also load multiple reports into the same timeline via File → Open and then selecting multiple report files. If you are running on a single node only, you can also prefix nsys profile [args] before mpiexec to write everything into a single report file.

Thank you for the hints, I’ll try and report back.

I was doing this originally, but for some reason this hid the overhead costs, which is why I switched around.

To follow up on this, a solution appears to be to (a) request 2 cpus per task (--cpus-per-task=2) and then launch the MPI job with srun --cpu-bind=cores (I found the hint in the NERSC documentation: Profiling - NERSC Development System Documentation)