Nsight Systems messure CUDA Fortran with MPI

cxs · September 30, 2025, 4:56am

Hi, guys.

I have developed a CUDA Fortran code with MPI, and I am trying to analyze the performance of the kernels and MPI communication using Nsight Systems (ver 2025.1.3).

I enabled the MPI/Open MPI option in Nsight Systems.

also set: sudo sh -c ‘echo 1 >/proc/sys/kernel/perf_event_paranoid’

run as: mpirun -np 4 ./main

After this, I can see CPU activities in the results, but I still cannot see the timing breakdown of MPI calls (e.g. MPI_Send, MPI_Recv).

Does anyone know how to correctly configure Nsight Systems so that I can profile and visualize MPI usage time?

Thanks in advance!

hwilper · September 30, 2025, 12:54pm

@rdietrich can you help.

rdietrich · September 30, 2025, 2:40pm

Did you add mpi to the tracing flag (--trace|-t mpi)? From the screenshot I guess that you’re using the default tracing options, which do not include MPI tracing. Otherwise, can you please send us the full nsys profile command you’re using?

cxs · October 8, 2025, 9:45am

Hi, I used the Nsight Systems UI to set it up, and I’m certain that I selected the MPI option.

I also tried profiling via CLI with:

nsys profile --trace=cuda,mpi --force-overwrite=true -o myprofile mpirun --use-hwthread-cpus -np 4 ./main > SCREEN &

However, the MPI API calls still do not appear in the timeline. Has anyone encountered this issue or knows what might cause it?

hwilper · October 8, 2025, 12:27pm

Can you try again explicitly setting the mpi type? this shouldn’t be an issue, but might be worth a check.

cxs · October 9, 2025, 1:47pm

hi, I use this script:

nsys profile --trace=cuda,mpi --mpi-impl=openmpi --force-overwrite=true -o myprofile-openmpi mpirun --use-hwthread-cpus -np 4 ./main > SCREEN &

and it still doesn’t show MPI API

cxs · October 10, 2025, 2:20pm

I also tried a simple example (mpihello.f90) from the installation directory and added some MPI_IRecv and MPI_ISend statements.
I used the following profiling command:

nsys profile --trace=cuda,mpi --mpi-impl=openmpi --force-overwrite=true \
    -o myprofile mpirun --use-hwthread-cpus -np 4 ./main > SCREEN &

In this test, I can see the MPI_ISend and MPI_IRecv calls in the Nsight Systems timeline.
However, when I profile my own code, I still cannot find any MPI calls there.

Does anyone know what could cause Nsight Systems to miss MPI traces from my code?
@hwilper @rdietrich

rdietrich · October 20, 2025, 10:13am

Nsight Systems uses LD_PRELOAD to inject the MPI tracing. How is MPI loaded/setup in your execution environment? How does the execution of your “real” app differ from the execution of your test app?

An option would be that MPI tracing silently fails. Let’s first check the stdout and stderr outputs (use the drop-down box “Timeline View” on the top left). If this doesn’t help, we should check NVLOG. To do so. save the following content to a nsys_nvlog.config file:

+ 75iewf 75IWEF global
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]: $text

Then add NVLOG_CONFIG_FILE= before the command line. After execution, there should be a log file at /tmp/nsight-sys.log.

Note that MPI Fortran 2008 is not supported by Nsight Systems and the MPI calls won’t show up.

cxs · October 22, 2025, 11:33am

I confirmed that CUDA-aware MPI is loaded by exporting the NVHPC MPI path. I ran the same nsys profile command for both a minimal test program and our real application:

nsys profile --trace=cuda,mpi --force-overwrite=true -o myprofile-WENO-JS \
  mpirun --use-hwthread-cpus -np 4 ./main > SCREEN-WENO-JS.log

The test program only calls MPI (e.g. MPI_Send/MPI_Recv) and does not launch any CUDA kernels. The real program runs normally and uses CUDA kernels + MPI. For both runs I checked stdout/stderr and there are no error messages; program outputs are correct.

I have tried your suggestion to generate the log file, and the file turned out to be quite large (run 100 steps, 2.6G).
Could you please tell me how I can analyze it?

rdietrich · October 23, 2025, 12:24pm

I didn’t expect the log to be so large. You can search for “MPI “ in the log, but I actually wanted to take a look at the log myself, since there a lot of other messages that can indicate an issue with MPI tracing.

Let’s try to get a smaller log by replacing + 75iewf 75IWEF global with
+ 50wefiWEFI global
+ 75wefiWEFI Injection
If this is still too large to share, replace the “global” with “quadd_common”.

Other things that could help to determine the issue:

Could you also share the Nsight Systems report?
Which MPI version are you using. Sounds like it is an OpenMPI from NVHPC SDK, correct?
multi-gpu-programming-models/mpi at master · NVIDIA/multi-gpu-programming-models · GitHub is a quite simple program that uses CUDA-aware MPI. Do you get MPI events when you profile it?

cxs · October 25, 2025, 2:29pm

Thank you for your detailed suggestions.

I followed your advice by using + 50wefiWEFI global and + 75wefiWEFI Injection with quadd_common. The attachent is the output file. This is report file
Yes, the MPI I am using is /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin/mpif90, which corresponds to OpenMPI 3.1.5 from the NVHPC SDK.

I encountered an issue while compiling this simple code (multi-gpu-programming-models/mpi/). After loading the NVHPC environment, I directly ran make, but it produced the following compilation error:

$ make
mpicxx -DUSE_NVTX -I/usr/local/cuda-12.2/include -std=c++14 jacobi.cpp jacobi_kernels.o -L/usr/local/cuda-12.2/lib64 -lcudart -ldl -o jacobi
jacobi.cpp:
"/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/include/algorithm", line 10: catastrophic error: cannot open source file "algorithm"
  #include_next <algorithm>
                           ^
1 catastrophic error detected in the compilation of "jacobi.cpp".
Compilation terminated.
make: *** [Makefile:32: jacobi] Error 2

I’m not very familiar with C++, so I asked ChatGPT for help. It suggested adding the option --gcc-toolchain=/usr when using mpicxx, which allowed the code to compile successfully.
However, when I run the program, I get the following runtime error:

ERROR: CUDA RT call "cudaGetLastError()" in line 71 of file jacobi_kernels.cu failed with no kernel image is available for execution on the device (209).

I have fixed this issue: I add $(GENCODE_SM60) into GENCOD_FLAGS due to P100, and add --gcc-toolchain=/usr. Then it works.

I try to profile this demo code, it shows the MPI calls.

nsight-sys.log (315.3 KB)

rdietrich · November 5, 2025, 5:06am

How do you compile/link your program? There has to be a difference compared to compiling/linking the toy application, e.g. static linkage. Do you use any other MPI wrappers? Can you provide the output of ldd ./main?

Please also try to run with mpirun --use-hwthread-cpus -np 4 nsys profile --trace=cuda,mpi -f true -o myprofile-WENO-JS.%q{OMPI_COMM_WORLD_RANK} ./main

(I requested access permissions for report file.)

cxs · November 10, 2025, 10:39am

I’m sorry for the late reply, and I’ve now added permission for you.

For my own code (CUDA Fortran), I compile it using a Makefile. The detailed compilation and linking output can be found in the attachment.

For the GitHub code (CUDA C), I also use its provided Makefile. Since I run it on a P100 GPU, I added compute_60 and sm_60 to the compilation flags. The full output is also attached.

You mentioned that there must be a difference compared to the toy application, such as static linkage.
Regarding MPI wrappers:

My own code: mpifort (/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin/mpifort)
GitHub code: nvcc (/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/bin/nvcc) and mpicxx (/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin/mpicxx)
Run: `mpirun` (/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin/mpirun)

The output of ldd ./main is included in the attachment as well.

I also tried running with:

mpirun --use-hwthread-cpus -np 4 nsys profile --trace=cuda,mpi -f true -o myprofile-WENO-JS.%q{OMPI_COMM_WORLD_RANK} ./main

However, the MPI calls still did not appear in the Nsight Systems trace.

reply-nvforum-2025-11-10.txt (10.6 KB)

rdietrich · November 10, 2025, 12:25pm

Thanks for providing the report file and the compile commands and outputs. Both show us that MPI Fortran 2008 bindings are used. In the report you can check the backtrace of sampling points (row with label main in your report). Look for mpi_isend_f08_, etc.

I didn’t emphasize it here, but, unfortunately, MPI Fortran 2008 bindings are not supported by Nsight Systems and the MPI calls won’t show up. As long as this is not supported by Nsight Systems, you might want to write your own MPI F08 wrappers and use NVTX to expose the MPI calls, similar to the NVIDIA NVTX Wrappers for MPI (which do not support MPI F08 either).

cxs · November 10, 2025, 1:37pm

Thank you very much for the clarification and detailed explanation.
That makes sense, I now understand why the MPI calls don’t appear in Nsight Systems.
I’ll look into adding custom MPI F08 wrappers and possibly using NVTX annotations as you suggested.

Thanks again for your help and for pointing me in the right direction.

JorgeG94 · November 13, 2025, 9:23am

You can try using vapaa which is an implementation of the mpi_f08 module. I just ran into the same problem but with vapaa things do show up on the profile. Careful though, not everything is implemented.

Topic		Replies	Views
Nsight does not recognize Cuda Fortran programs compiled by mpif90 nvc, nvc++ and nvfortran	7	166	September 24, 2024
Is the mpi_f08 module in fortran not instrumented for profiling with nsight systems? Profiling Linux Targets	2	46	November 13, 2025
Nsight Systems does not collect CUDA events Profiling Linux Targets	21	10097	January 11, 2023
Nsys for multi GPU apps Profiling Linux Targets	1	1435	September 10, 2018
Nsys Profile with MPMD(multiple program and multiple data) simulation Profiling Linux Targets nsight , openmpi	6	1662	May 20, 2021
Does Nsight system support Tracing of Fortran based MPI application? Profiling Linux Targets openmpi	2	834	February 23, 2022
Cannot collect CUDA trace data Profiling Linux Targets	7	1570	November 20, 2023
Option to profile only master process Nsight Compute cuda	23	3967	December 1, 2023
Nsys profiling MPI jobs Profiling Linux Targets nsight , hpc	1	2653	November 7, 2020
Nsight system HPC Linux installation nvc, nvc++ and nvfortran	7	1892	August 31, 2021

Nsight Systems messure CUDA Fortran with MPI

Related topics