How I can calculate the data transfer time between CPU and GPU?

Hello everyone,

I am doing a comparison between serial execution time and OpenACC execution time. When I have a small dataset, serial code performs faster than OpenACC but I know for sure one of the reasons behind that is the data transfer overhead because when I provide a medium dataset I can see very clearly that ACC performs better than serial without no comparison.
My question: Is there a way to calculate the data transfer between CPU and GPU?

Thank you in advance

You’ll want to use a profiler such as Nsight-Systems which will show you the data transfer time. See: https://developer.nvidia.com/nsight-systems for details. Nsight-Systems and Nsight-Compute are shipped as part of the NVIDIA HPC SDK under the “<INSTALL_DIR>//profilers/” directory.

Alternatively, you can set the environment variable “NV_ACC_TIME=1” which will give a basic command line profile, including transfer time,

-Mat

I have Nsight compute on my device, is there any good tutorial for beginners, on how to compute and see the data transfer time for .cpp + openACC.

For data transfers, you’ll want to use Nsight-Systems, not Nsight-Compute. Compute is best for low level performance kernel analysis while Systems gives a higher level systems view, including data transfers.

Performing a web search of “nvidia nsight-systems tutorial” will yield many results. Besides the Nsight-systems documentation, this video may be helpful: https://www.youtube.com/watch?v=kKANP0kL_hk

I run this command to get the report

nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o output ./test

then I open the result in Nsight System, but I don’t know where to find the time of data transfer, I watched the youtube video but it can’t help that much especially since the report doesn’t have enough information in it.
I upload the report here:

Report

Also, in my code, I have surround one loop with piece of code to calculate the time, when I run the binary file without using the profiler the time is less than when run it using the profiler.
Is there any explanation for this behavior?

Is this your simple sum code? I took a look at the profile and the only data movement (besides the profiler overhead) is a few bytes, which is most likely the reduction variable being copied from and to the host.

You need to zoom in quite a bit since the kernel is so short. Most of the time profiled is just the start-up of the binary and overhead from the profiler. I would suggest writing some code that has more compute (or using a bigger example such as our example matrix multiply found in “<nv_install_dir>/2020/examples/OpenACC/SDK/src/matrixMul/”) and does perform some data movement, otherwise the profile is mostly just CPU and the timeline isn’t going to show you much.

For this toy program, the compiler runtime profile may be better:

% nvc++ -acc sum.cpp -Minfo=accel
main:
      6, Generating Tesla code
          9, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
      6, Generating implicit copy(sum) [if not already present]
% setenv NV_ACC_TIME 1
% ./sum.exe
1024
Sum = 524800
Accelerator Kernel Timing data
 sum.cpp
  main  NVIDIA  devicenum=0
    time(us): 55
    6: compute region reached 1 time
        6: kernel launched 1 time
            grid: [8]  block: [128]
             device time(us): total=4 max=4 min=4 avg=4
            elapsed time(us): total=821 max=821 min=821 avg=821
    6: data region reached 2 times
        6: data copyin transfers: 1
             device time(us): total=8 max=8 min=8 avg=8
        13: data copyout transfers: 1
             device time(us): total=43 max=43 min=43 avg=43