Nsys profiling MPI jobs

I heard that nvprof/nvvp will be deprecated in the future, so I am testing nsys and ncu. I am trying to profile a MPI+CUDA program (simple send/recv between two processes, each process targets on a GPU) with nsys. I am running my code on a GPU node (2 CascadeLake CPUs + 4 V100 GPUs). I am using OpenMPI from Nvidia HPC SDK.

The code main.cpp (3.1 KB) (weird that this forum doesn’t allow us to upload .cu file… I changed the file from .cu to .cpp)
Here is how I compile it:
nvcc -c main.cu -std=c++11 -ccbin mpic++ -arch=sm_70
mpic++ -o main main.o -lcudart -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/10.2/lib64/ -I/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/10.2/include/

Here is how I run it with nsys:
nsys profile --stats=true --trace=cuda,mpi --mpi-impl=openmpi --output=./nsys.clgpu01.np2 --force-overwrite=true mpirun -np 2 ./main 5

However, nsys is reporting the following error:

Processing [==============================================================100%]
Saved report file to “/tmp/nsys-report-85ee-866e-fc88-ae7f.qdrep”
Exporting 137638 events: [================================================99% ]
Failed to export report.
/fast/src/Alt/QuadD/Common/ProtobufComm/Common/ProtobufUtils.cpp(70): Throw in function void QuadDProtobufUtils::ReadMessage(QuadDProtobufUtils::PbCodedIStream&, QuadDProtobufUtils::PbMessageLite&)
Dynamic exception type: boost::wrapexceptQuadDCommon::ProtobufParseException
std::exception::what: ProtobufParseException

What do these messages mean? Also, compared with nvprof + nvvp, I don’t see cuMemcpyP2P generated for CUDA-aware MPI using GPUDirect P2P, which is important information I am looking for to tell whether the communication is going through NVLink directly (P2P) or passing the CPU (D2H+H2D). Is there any way that I turn on this information with nsys? Thanks. I really like the separate memory view in nvvp/nvprof showing the memory copy categories (H2D, D2H, P2P).

Again I figured out this myself. This error message occurs whenever I try to profile multiple GPUs (with or without MPI) and provide “–stats=true” to nsys. Once I removed “–stats=true” in nsys command, then the error will go away.