Capture calls to cudaMemcpy

Hi,

I’m trying to capture the calls to cudaMemcpy* for an application running under a cuda-aware MPI (mvapich2).
This MPI uses uva for DtoD communication.
I have a simple ping-pong test and with nvvp I can see some cudaMemcpyAsync DtoD as I expect (but without showing the id of the source and destination devices).
However, when I capture calls to cudaMemcpy* either with CUPTI, seo or by overriding the cudaMemcpy function symbols (with dlsym) I don’t see any DtoD.
I capture only some DtoH, AtoH and HtoA memory kinds.
Is uva not using cudaMemcpy to transfer the data between the GPUs?

Moreover, in a DtoD transfer I would like to be able to identify the source and destination devices.
I have tried to use cudaPointerAttributes but I only obtain -1 in the device field.
How do I identify the device id of the targeted device in a uva mode?

Thanks!

The following profiler switch will trace kernel launches as well as cudaMemcpy* ops including DtoD:

[s]

nvprof --gpu-print-trace <app>

[/s]

nvprof --print-gpu-trace <app>

Output for the CUDA Samples “bandwidthTest” application:

==11740== NVPROF is profiling process 11740, command: C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release\bandwidthTest.exe
==11740== Profiling application: C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release\bandwidthTest.exe
==11740== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
343.36ms  3.0577ms                    -               -         -         -         -  32.000MB  10.220GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
346.42ms  3.0883ms                    -               -         -         -         -  32.000MB  10.119GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
349.51ms  3.1121ms                    -               -         -         -         -  32.000MB  10.042GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
352.62ms  3.0595ms                    -               -         -         -         -  32.000MB  10.214GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
355.68ms  3.0785ms                    -               -         -         -         -  32.000MB  10.151GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
358.76ms  3.1197ms                    -               -         -         -         -  32.000MB  10.017GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
361.88ms  3.0830ms                    -               -         -         -         -  32.000MB  10.136GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
364.97ms  3.1406ms                    -               -         -         -         -  32.000MB  9.9503GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
368.11ms  3.3384ms                    -               -         -         -         -  32.000MB  9.3608GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
371.45ms  3.0926ms                    -               -         -         -         -  32.000MB  10.105GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
411.52ms  3.1089ms                    -               -         -         -         -  32.000MB  10.052GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
414.75ms  3.0683ms                    -               -         -         -         -  32.000MB  10.185GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
417.82ms  3.0960ms                    -               -         -         -         -  32.000MB  10.094GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
420.91ms  3.0690ms                    -               -         -         -         -  32.000MB  10.183GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
423.98ms  3.0768ms                    -               -         -         -         -  32.000MB  10.157GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
427.06ms  3.0936ms                    -               -         -         -         -  32.000MB  10.102GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
430.16ms  3.0231ms                    -               -         -         -         -  32.000MB  10.337GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
433.18ms  3.0696ms                    -               -         -         -         -  32.000MB  10.181GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
436.25ms  3.0572ms                    -               -         -         -         -  32.000MB  10.222GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
439.31ms  3.0963ms                    -               -         -         -         -  32.000MB  10.093GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
442.41ms  3.0375ms                    -               -         -         -         -  32.000MB  10.288GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoH]
475.24ms  5.7067ms                    -               -         -         -         -  32.000MB  5.4760GB/s  GeForce GTX 980         1         7  [CUDA memcpy HtoD]
480.95ms  342.66us                    -               -         -         -         -  32.000MB  91.199GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
481.29ms  340.48us                    -               -         -         -         -  32.000MB  91.782GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
481.64ms  340.19us                    -               -         -         -         -  32.000MB  91.859GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
481.98ms  341.25us                    -               -         -         -         -  32.000MB  91.575GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
482.32ms  340.00us                    -               -         -         -         -  32.000MB  91.911GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
482.67ms  340.42us                    -               -         -         -         -  32.000MB  91.799GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
483.02ms  340.00us                    -               -         -         -         -  32.000MB  91.911GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
483.36ms  340.71us                    -               -         -         -         -  32.000MB  91.722GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
483.70ms  340.03us                    -               -         -         -         -  32.000MB  91.903GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]
484.04ms  340.67us                    -               -         -         -         -  32.000MB  91.730GB/s  GeForce GTX 980         1         7  [CUDA memcpy DtoD]

Scroll to the bottom right to see the DtoD ops.

Unfortunately, it doesn’t reveal the source and destination devices.

You might be better off using Nsight since it provides even more information and you can optionally annotate your application.

Here’s “bandwidthTest” in Nsight:

I think it is --print-gpu-trace