GPU time measuring using accel.h routines PGI 20.1

Hi,

I’m trying to measure gpu time using the following routines from accel.h with PGI 20.1:

acc_enable_time( acc_device_nvidia );
...
long etime = acc_exec_time( acc_device_nvidia );
long ttime = acc_total_time( acc_device_nvidia );
acc_disable_time( acc_device_nvidia);

But the values returned by the etime and ttime always gives me the same value (2 us), no matter the program or the number of kernels executed in the device.

What I’m doing wrong ?

example:

#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#include <accel.h>

#define N 100000000

int main() {
    double pi = 0.0; long i;
    acc_init(acc_device_nvidia);
    acc_enable_time(acc_device_nvidia);
    #pragma acc parallel loop reduction(+:pi)
    for (i=0; i<N; i++){
        double t = (double)((i+0.05)/N);
        pi += 4.0/(1.0+t*t);
        }
    printf("pi = %f\n", pi/N);
    long etime = acc_exec_time(acc_device_nvidia);
    long ttime = acc_total_time(acc_device_nvidia);

    acc_disable_time(acc_device_nvidia);

    fprintf(stderr, "Execution time in device kernels: %ld\n", etime);
    fprintf(stderr, "Total time in device regions: %ld\n", ttime);
    return 0;

}

execution result:
pi = 3.141593
Execution time in device kernels: 2
Total time in device regions: 2

Thanks in advance.

Hi ALI MUHAMMAD,

Hmm, those routines are very old and were part of the PGI Accelerator Model which has been deprecated. We’ve also since have added better ways to profile kernels either via the environment variable “PGI_ACC_TIME” or NVIDIA profilers such as Nsight Compute and Nsight Systems.

Can you try using PGI_ACC_TIME and see if this gives you the information you’re looking for?

% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

#define N 100000000

int main() {
    double pi = 0.0; long i;
    #pragma acc parallel loop reduction(+:pi)
    for (i=0; i<N; i++){
        double t = (double)((i+0.05)/N);
        pi += 4.0/(1.0+t*t);
        }
    printf("pi = %f\n", pi/N);
    return 0;

}
% setenv PGI_ACC_TIME 1
% pgcc -ta=tesla test.c -Minfo=accel; a.out
main:
     10, Generating Tesla code
         10, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:pi)
     10, Generating implicit copy(pi) [if not already present]
pi = 3.141593

Accelerator Kernel Timing data
test.c
  main  NVIDIA  devicenum=0
    time(us): 825
    10: compute region reached 1 time
        10: kernel launched 1 time
            grid: [65535]  block: [128]
             device time(us): total=678 max=678 min=678 avg=678
            elapsed time(us): total=763 max=763 min=763 avg=763
        10: reduction kernel launched 1 time
            grid: [1]  block: [256]
             device time(us): total=113 max=113 min=113 avg=113
            elapsed time(us): total=135 max=135 min=135 avg=135
    10: data region reached 2 times
        10: data copyin transfers: 1
             device time(us): total=6 max=6 min=6 avg=6
        13: data copyout transfers: 1
             device time(us): total=28 max=28 min=28 avg=28

-Mat

Thank you for your prompt reply!

Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?

Thank you very much!

Thank you for your prompt reply!

Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?

Thank you very much!

Thank you for your prompt reply, Mat!

Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?

Thank you very much!

Using PGI_ACC_TIME or profiling tools, such as Nsight-systems, would certain be the quickest and easiest way to get this information. Nsight-compute can get you some very deep level insight about the performance of individual kernels.

Sans that, OpenACC does provide a profiling interface you can tap into (See section 5: https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf). Though this isn’t designed to end-users but rather an interface to profiling tools.

There’s also Nvidia’s CUPIT library (https://docs.nvidia.com/cupti/Cupti/r_overview.html#r_overview). Again it’s not really meant for end users but for use by profilers (it’s what PGI_ACC_TIME uses).

-Mat