Nsight Systems not capturing "CUDA memcpy PtoP" messages

I am profiling an MPI+Cuda application, which uses RDMA. My run is on 8 MPI ranks where each MPI rank uses a different GPU. Nsight Systems’ gputrace includes only HtoD and DtoH types of “CUDA memcpy *” operations, whereas nvprof’s gputrace includes all three memcpy types: HtoD, DtoH, and PtoP.

In order to make sure that Nsight Systems is not miscategorizing the PtoP messages as DtoH or HtoD, I compared the total number of DtoH and HtoD messages in both profilers and observed that those numbers are the same, so it is not miscategorizing the PtoP messages. It is simply not capturing the PtoP messages at all.

I would appreciate any help with this issue.

My apologies for not seeing this earlier.

Are you using CUDA_VISIBLE_DEVICES? We have an open issue we are working on when CUDA_VISIBLE_DEVICES is used to handle GPU affinity in a multi GPU MPI Program. P2P copies are not visible on the timeline.

Yes. In my particular run, jsrun sets CUDA_VISIBLE_DEVICES for each MPI rank as follows:

Task 0 ( 0/8, 0/4 ) is bound to cpu[s] 0-3 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={0:4} and CUDA_VISIBLE_DEVICES=0
Task 4 ( 4/8, 0/4 ) is bound to cpu[s] 0-3 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={0:4} and CUDA_VISIBLE_DEVICES=0
Task 5 ( 5/8, 1/4 ) is bound to cpu[s] 28-31 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={28:4} and CUDA_VISIBLE_DEVICES=1
Task 1 ( 1/8, 1/4 ) is bound to cpu[s] 28-31 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={28:4} and CUDA_VISIBLE_DEVICES=1
Task 6 ( 6/8, 2/4 ) is bound to cpu[s] 56-59 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={56:4} and CUDA_VISIBLE_DEVICES=2
Task 2 ( 2/8, 2/4 ) is bound to cpu[s] 56-59 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={56:4} and CUDA_VISIBLE_DEVICES=2
Task 7 ( 7/8, 3/4 ) is bound to cpu[s] 88-91 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={88:4} and CUDA_VISIBLE_DEVICES=3
Task 3 ( 3/8, 3/4 ) is bound to cpu[s] 88-91 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={88:4} and CUDA_VISIBLE_DEVICES=3

Okay, that looks like an issue we had discovered internally. I’ve upped the priority on that and reached out to a couple of my engineers to figure out what is going on.

If it is not too much trouble, can you try it without CUDA_VISIBLE_DEVICES and see if it works for you. I know this doesn’t solve your problem, but it will help us to make sure that what you are seeing is the problem we are working on.

Unfortunately, I don’t think unsetting CUDA_VISIBLE_DEVICES is feasible in my case. I’m on Summit and jsrun here automatically sets this environment variable according to the resource set, binding, and gpus_per_resource_set values. For more information, please refer to Summit User Guide — OLCF User Documentation
My jsrun command looks as follows:
jsrun -n8 -r4 -g1 -a1 ...
It means I am using 2 nodes in total and 4 resource sets per node, where each resource set consists of one MPI rank and one GPU.

Engineer says he knows how to fix it, and since I know you are all NDAd over there we should be able to get you the fix relatively soon.

(and for anyone hitting this thread later, we should be able to get this fix into the next release, which will be 2021.1.3)

Thanks @hwilper!