Nsight Systems messure CUDA Fortran with MPI

Hi, guys.

I have developed a CUDA Fortran code with MPI, and I am trying to analyze the performance of the kernels and MPI communication using Nsight Systems (ver 2025.1.3).

I enabled the MPI/Open MPI option in Nsight Systems.

also set: sudo sh -c ‘echo 1 >/proc/sys/kernel/perf_event_paranoid’

run as: mpirun -np 4 ./main

After this, I can see CPU activities in the results, but I still cannot see the timing breakdown of MPI calls (e.g. MPI_Send, MPI_Recv).

Does anyone know how to correctly configure Nsight Systems so that I can profile and visualize MPI usage time?

Thanks in advance!

@rdietrich can you help.

Did you add mpi to the tracing flag (--trace|-t mpi)? From the screenshot I guess that you’re using the default tracing options, which do not include MPI tracing. Otherwise, can you please send us the full nsys profile command you’re using?

Hi, I used the Nsight Systems UI to set it up, and I’m certain that I selected the MPI option.

I also tried profiling via CLI with:

nsys profile --trace=cuda,mpi --force-overwrite=true -o myprofile mpirun --use-hwthread-cpus -np 4 ./main > SCREEN & 

However, the MPI API calls still do not appear in the timeline. Has anyone encountered this issue or knows what might cause it?

Can you try again explicitly setting the mpi type? this shouldn’t be an issue, but might be worth a check.

hi, I use this script:

nsys profile --trace=cuda,mpi --mpi-impl=openmpi --force-overwrite=true -o myprofile-openmpi mpirun --use-hwthread-cpus -np 4 ./main > SCREEN &

and it still doesn’t show MPI API

I also tried a simple example (mpihello.f90) from the installation directory and added some MPI_IRecv and MPI_ISend statements.
I used the following profiling command:

nsys profile --trace=cuda,mpi --mpi-impl=openmpi --force-overwrite=true \
    -o myprofile mpirun --use-hwthread-cpus -np 4 ./main > SCREEN &

In this test, I can see the MPI_ISend and MPI_IRecv calls in the Nsight Systems timeline.
However, when I profile my own code, I still cannot find any MPI calls there.

Does anyone know what could cause Nsight Systems to miss MPI traces from my code?
@hwilper @rdietrich

Nsight Systems uses LD_PRELOAD to inject the MPI tracing. How is MPI loaded/setup in your execution environment? How does the execution of your “real” app differ from the execution of your test app?

An option would be that MPI tracing silently fails. Let’s first check the stdout and stderr outputs (use the drop-down box “Timeline View” on the top left). If this doesn’t help, we should check NVLOG. To do so. save the following content to a nsys_nvlog.config file:

+ 75iewf 75IWEF global
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]: $text

Then add NVLOG_CONFIG_FILE= before the command line. After execution, there should be a log file at /tmp/nsight-sys.log.

Note that MPI Fortran 2008 is not supported by Nsight Systems and the MPI calls won’t show up.

I confirmed that CUDA-aware MPI is loaded by exporting the NVHPC MPI path. I ran the same nsys profile command for both a minimal test program and our real application:

nsys profile --trace=cuda,mpi --force-overwrite=true -o myprofile-WENO-JS \
  mpirun --use-hwthread-cpus -np 4 ./main > SCREEN-WENO-JS.log

The test program only calls MPI (e.g. MPI_Send/MPI_Recv) and does not launch any CUDA kernels. The real program runs normally and uses CUDA kernels + MPI. For both runs I checked stdout/stderr and there are no error messages; program outputs are correct.

I have tried your suggestion to generate the log file, and the file turned out to be quite large (run 100 steps, 2.6G).
Could you please tell me how I can analyze it?

I didn’t expect the log to be so large. You can search for “MPI “ in the log, but I actually wanted to take a look at the log myself, since there a lot of other messages that can indicate an issue with MPI tracing.

Let’s try to get a smaller log by replacing + 75iewf 75IWEF global with
+ 50wefiWEFI global
+ 75wefiWEFI Injection
If this is still too large to share, replace the “global” with “quadd_common”.

Other things that could help to determine the issue:

Thank you for your detailed suggestions.

  • I followed your advice by using + 50wefiWEFI global and + 75wefiWEFI Injection with quadd_common. The attachent is the output file. This is report file

  • Yes, the MPI I am using is /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin/mpif90, which corresponds to OpenMPI 3.1.5 from the NVHPC SDK.

I encountered an issue while compiling this simple code (multi-gpu-programming-models/mpi/). After loading the NVHPC environment, I directly ran make, but it produced the following compilation error:

$ make
mpicxx -DUSE_NVTX -I/usr/local/cuda-12.2/include -std=c++14 jacobi.cpp jacobi_kernels.o -L/usr/local/cuda-12.2/lib64 -lcudart -ldl -o jacobi
jacobi.cpp:
"/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/include/algorithm", line 10: catastrophic error: cannot open source file "algorithm"
  #include_next <algorithm>
                           ^
1 catastrophic error detected in the compilation of "jacobi.cpp".
Compilation terminated.
make: *** [Makefile:32: jacobi] Error 2

I’m not very familiar with C++, so I asked ChatGPT for help. It suggested adding the option --gcc-toolchain=/usr when using mpicxx, which allowed the code to compile successfully.
However, when I run the program, I get the following runtime error:

ERROR: CUDA RT call "cudaGetLastError()" in line 71 of file jacobi_kernels.cu failed with no kernel image is available for execution on the device (209).

I have fixed this issue: I add $(GENCODE_SM60) into GENCOD_FLAGS due to P100, and add --gcc-toolchain=/usr. Then it works.

I try to profile this demo code, it shows the MPI calls.

nsight-sys.log (315.3 KB)

How do you compile/link your program? There has to be a difference compared to compiling/linking the toy application, e.g. static linkage. Do you use any other MPI wrappers? Can you provide the output of ldd ./main?

Please also try to run with mpirun --use-hwthread-cpus -np 4 nsys profile --trace=cuda,mpi -f true -o myprofile-WENO-JS.%q{OMPI_COMM_WORLD_RANK} ./main

(I requested access permissions for report file.)