I am profiling an MPI+Cuda application, which uses RDMA. My run is on 8 MPI ranks where each MPI rank uses a different GPU. Nsight Systems’ gputrace includes only HtoD and DtoH types of “CUDA memcpy *” operations, whereas nvprof’s gputrace includes all three memcpy types: HtoD, DtoH, and PtoP.
In order to make sure that Nsight Systems is not miscategorizing the PtoP messages as DtoH or HtoD, I compared the total number of DtoH and HtoD messages in both profilers and observed that those numbers are the same, so it is not miscategorizing the PtoP messages. It is simply not capturing the PtoP messages at all.
I would appreciate any help with this issue.