Nsight Systems not capturing "CUDA memcpy PtoP" messages

acers · May 18, 2021, 7:12pm

I am profiling an MPI+Cuda application, which uses RDMA. My run is on 8 MPI ranks where each MPI rank uses a different GPU. Nsight Systems’ gputrace includes only HtoD and DtoH types of “CUDA memcpy *” operations, whereas nvprof’s gputrace includes all three memcpy types: HtoD, DtoH, and PtoP.

In order to make sure that Nsight Systems is not miscategorizing the PtoP messages as DtoH or HtoD, I compared the total number of DtoH and HtoD messages in both profilers and observed that those numbers are the same, so it is not miscategorizing the PtoP messages. It is simply not capturing the PtoP messages at all.

I would appreciate any help with this issue.

hwilper · July 2, 2021, 9:28pm

My apologies for not seeing this earlier.

Are you using CUDA_VISIBLE_DEVICES? We have an open issue we are working on when CUDA_VISIBLE_DEVICES is used to handle GPU affinity in a multi GPU MPI Program. P2P copies are not visible on the timeline.

acers · July 7, 2021, 2:33pm

Yes. In my particular run, jsrun sets CUDA_VISIBLE_DEVICES for each MPI rank as follows:

Task 0 ( 0/8, 0/4 ) is bound to cpu[s] 0-3 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={0:4} and CUDA_VISIBLE_DEVICES=0
Task 4 ( 4/8, 0/4 ) is bound to cpu[s] 0-3 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={0:4} and CUDA_VISIBLE_DEVICES=0
Task 5 ( 5/8, 1/4 ) is bound to cpu[s] 28-31 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={28:4} and CUDA_VISIBLE_DEVICES=1
Task 1 ( 1/8, 1/4 ) is bound to cpu[s] 28-31 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={28:4} and CUDA_VISIBLE_DEVICES=1
Task 6 ( 6/8, 2/4 ) is bound to cpu[s] 56-59 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={56:4} and CUDA_VISIBLE_DEVICES=2
Task 2 ( 2/8, 2/4 ) is bound to cpu[s] 56-59 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={56:4} and CUDA_VISIBLE_DEVICES=2
Task 7 ( 7/8, 3/4 ) is bound to cpu[s] 88-91 on host f36n15 with OMP_NUM_THREADS=4 and with OMP_PLACES={88:4} and CUDA_VISIBLE_DEVICES=3
Task 3 ( 3/8, 3/4 ) is bound to cpu[s] 88-91 on host f36n14 with OMP_NUM_THREADS=4 and with OMP_PLACES={88:4} and CUDA_VISIBLE_DEVICES=3

hwilper · July 7, 2021, 3:10pm

Okay, that looks like an issue we had discovered internally. I’ve upped the priority on that and reached out to a couple of my engineers to figure out what is going on.

If it is not too much trouble, can you try it without CUDA_VISIBLE_DEVICES and see if it works for you. I know this doesn’t solve your problem, but it will help us to make sure that what you are seeing is the problem we are working on.

acers · July 7, 2021, 4:19pm

Unfortunately, I don’t think unsetting CUDA_VISIBLE_DEVICES is feasible in my case. I’m on Summit and jsrun here automatically sets this environment variable according to the resource set, binding, and gpus_per_resource_set values. For more information, please refer to Summit User Guide — OLCF User Documentation
My jsrun command looks as follows:
jsrun -n8 -r4 -g1 -a1 ...
It means I am using 2 nodes in total and 4 resource sets per node, where each resource set consists of one MPI rank and one GPU.

hwilper · July 8, 2021, 11:58am

Engineer says he knows how to fix it, and since I know you are all NDAd over there we should be able to get you the fix relatively soon.

hwilper · July 8, 2021, 11:59am

(and for anyone hitting this thread later, we should be able to get this fix into the next release, which will be 2021.1.3)

acers · July 8, 2021, 4:56pm

Thanks @hwilper!

Topic		Replies	Views
Tracing data copies with CUDA-AWARE MPI Profiling Linux Targets	4	819	September 26, 2023
Nsight Systems messure CUDA Fortran with MPI Profiling Linux Targets	15	257	November 13, 2025
Nsys profiling MPI jobs Profiling Linux Targets nsight , hpc	1	2616	November 7, 2020
NSIGHT: Not recording cudaMem API calls Nsight Visual Studio Edition	6	765	October 12, 2021
Issue with Nsight Systems Profiling After GPU Configuration Change Profiling x86 Windows Targets	5	29	February 13, 2026
NVIDIA Nsight System: How can I use NVIDIA Nsight System analysis my project? Profiling x86 Windows Targets cuda , ubuntu	0	326	July 28, 2024
CUDA HW trace ends while SM activity continues long after (MPS setup) Profiling Linux Targets	5	105	November 5, 2025
Nsight compute profile run with nan value in multi-process service(MPS) Nsight Compute kernel	10	1265	July 25, 2024
understand COMPUTE_PROFILE output Legacy PGI Compilers	6	9821	September 19, 2014
Option to profile only master process Nsight Compute cuda	23	3932	December 1, 2023

Nsight Systems not capturing "CUDA memcpy PtoP" messages

Related topics