pgprof: contents of variable before entering parallel region

Hello,
I would like to profile the performance of an OpenACC C Program. I have to view the C source code as a blackbox so I cannot just create a special program for performance testing (but I can add my own function calls).

What do I want to do?
I would like to get metrics on the data movement (Host to Device, Device to Host) for specific OpenACC regions and their execution time (when executed on GPU and CPU (multicore)). But I need additional information on the content of
some variables before the OpenACC regions are entered.

What I have done so far/How I thought I can solve this:
For now I have just printed the content of variables to stdout and measured the execution time by simply getting the difference beween the time when the OpenACC Region begins and ends. I think that I cannot measure the memory
transfer with this setup. So some code might look like:

printf"SomeVar=%d\n", SomeVar) );
struct timeval tim;
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);

#pragma acc data copy(SomeArray[0:ArrayLength])
#pragma acc parallel loop gang vector present(SomeOtherArray[0:ArrayLength])
for(i=0; i<ArrayLength; i++){
  // Do some calculations
}

gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("for loop took %.6lf seconds \n", t2-t1);

I would then just parse the output and that works fine for me. The problem with this is, that I cannot get the time for the memory transfer (or is it somehow possible?)

When I use pgprof (esp. with --print-gpu-trace), I can get the memory transfer times and the execution time for the parallel region, but not the value for the “SomeVar” variable directly before the loop. This value changes on
different runs and I need to make a connection between this value, the memory transfer times and the time the actual computation needs.


How can I achieve this goal, so I can assess all the metrics? I have searched the official guides for a solution, but I could not find any. But I am still new to OpenACC and I might missed something (although I hope not).

Thanks,
Daniel

Hi Daniel,

This sounds like a good place to use NVTX (https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx). Basically you can tag regions of code which are then grouped in the profile.

For example, you could do something lie:

#include <nvToolsExt.h>
...
char str[80];
printf("SomeVar=%d\n", SomeVar); 
sprintf(str,"VAR%d",SomeVar);
nvtxRangePush(str);

#pragma acc data copy(SomeArray[0:ArrayLength]) 
#pragma acc parallel loop gang vector present(SomeOtherArray[0:ArrayLength]) 

for(i=0; i<ArrayLength; i++){ 
  // Do some calculations 
} 
nvtxRangePop();

Note: compile/link with “-I/path/to/cuda/include -L/path/to/cuda/lib64 -l -lnvToolsExt”


Then your profile will include “Ranges” based on the str value. Something like:

========       Range "VAR1"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  59.325us         1  59.325us  59.325us  59.325us  VAR1
 GPU activities:  100.00%  45.208ms         1  45.208ms  45.208ms  45.208ms  c_kernel_76_gpu
      API calls:  100.00%  11.587us         1  11.587us  11.587us  11.587us  cuLaunchKernel

Hope this helps,
Mat