GPU time measuring using accel.h routines PGI 20.1

ALI_MUHAMMAD · May 28, 2020, 3:32pm

Hi,

I’m trying to measure gpu time using the following routines from accel.h with PGI 20.1:

acc_enable_time( acc_device_nvidia );
...
long etime = acc_exec_time( acc_device_nvidia );
long ttime = acc_total_time( acc_device_nvidia );
acc_disable_time( acc_device_nvidia);

But the values returned by the etime and ttime always gives me the same value (2 us), no matter the program or the number of kernels executed in the device.

What I’m doing wrong ?

example:

#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#include <accel.h>

#define N 100000000

int main() {
    double pi = 0.0; long i;
    acc_init(acc_device_nvidia);
    acc_enable_time(acc_device_nvidia);
    #pragma acc parallel loop reduction(+:pi)
    for (i=0; i<N; i++){
        double t = (double)((i+0.05)/N);
        pi += 4.0/(1.0+t*t);
        }
    printf("pi = %f\n", pi/N);
    long etime = acc_exec_time(acc_device_nvidia);
    long ttime = acc_total_time(acc_device_nvidia);

    acc_disable_time(acc_device_nvidia);

    fprintf(stderr, "Execution time in device kernels: %ld\n", etime);
    fprintf(stderr, "Total time in device regions: %ld\n", ttime);
    return 0;

}

execution result:
pi = 3.141593
Execution time in device kernels: 2
Total time in device regions: 2

Thanks in advance.

MatColgrove · May 28, 2020, 4:13pm

Hi ALI MUHAMMAD,

Hmm, those routines are very old and were part of the PGI Accelerator Model which has been deprecated. We’ve also since have added better ways to profile kernels either via the environment variable “PGI_ACC_TIME” or NVIDIA profilers such as Nsight Compute and Nsight Systems.

Can you try using PGI_ACC_TIME and see if this gives you the information you’re looking for?

% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

#define N 100000000

int main() {
    double pi = 0.0; long i;
    #pragma acc parallel loop reduction(+:pi)
    for (i=0; i<N; i++){
        double t = (double)((i+0.05)/N);
        pi += 4.0/(1.0+t*t);
        }
    printf("pi = %f\n", pi/N);
    return 0;

}
% setenv PGI_ACC_TIME 1
% pgcc -ta=tesla test.c -Minfo=accel; a.out
main:
     10, Generating Tesla code
         10, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:pi)
     10, Generating implicit copy(pi) [if not already present]
pi = 3.141593

Accelerator Kernel Timing data
test.c
  main  NVIDIA  devicenum=0
    time(us): 825
    10: compute region reached 1 time
        10: kernel launched 1 time
            grid: [65535]  block: [128]
             device time(us): total=678 max=678 min=678 avg=678
            elapsed time(us): total=763 max=763 min=763 avg=763
        10: reduction kernel launched 1 time
            grid: [1]  block: [256]
             device time(us): total=113 max=113 min=113 avg=113
            elapsed time(us): total=135 max=135 min=135 avg=135
    10: data region reached 2 times
        10: data copyin transfers: 1
             device time(us): total=6 max=6 min=6 avg=6
        13: data copyout transfers: 1
             device time(us): total=28 max=28 min=28 avg=28

-Mat

ALI_MUHAMMAD · May 28, 2020, 7:23pm

Thank you for your prompt reply!

Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?

Thank you very much!

ALI_MUHAMMAD · May 28, 2020, 11:25pm

Thank you for your prompt reply!

Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?

Thank you very much!

ALI_MUHAMMAD · May 29, 2020, 12:10am

mkcolg:

Hi ALI MUHAMMAD,

Hmm, those routines are very old and were part of the PGI Accelerator Model which has been deprecated. We’ve also since have added better ways to profile kernels either via the environment variable “PGI_ACC_TIME” or NVIDIA profilers such as Nsight Compute and Nsight Systems.

Can you try using PGI_ACC_TIME and see if this gives you the information you’re looking for?

% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

#define N 100000000

int main() {
    double pi = 0.0; long i;
    #pragma acc parallel loop reduction(+:pi)
    for (i=0; i<N; i++){
        double t = (double)((i+0.05)/N);
        pi += 4.0/(1.0+t*t);
        }
    printf("pi = %f\n", pi/N);
    return 0;

}
% setenv PGI_ACC_TIME 1
% pgcc -ta=tesla test.c -Minfo=accel; a.out
main:
     10, Generating Tesla code
         10, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:pi)
     10, Generating implicit copy(pi) [if not already present]
pi = 3.141593

Accelerator Kernel Timing data
test.c
  main  NVIDIA  devicenum=0
    time(us): 825
    10: compute region reached 1 time
        10: kernel launched 1 time
            grid: [65535]  block: [128]
             device time(us): total=678 max=678 min=678 avg=678
            elapsed time(us): total=763 max=763 min=763 avg=763
        10: reduction kernel launched 1 time
            grid: [1]  block: [256]
             device time(us): total=113 max=113 min=113 avg=113
            elapsed time(us): total=135 max=135 min=135 avg=135
    10: data region reached 2 times
        10: data copyin transfers: 1
             device time(us): total=6 max=6 min=6 avg=6
        13: data copyout transfers: 1
             device time(us): total=28 max=28 min=28 avg=28

-Mat

Thank you for your prompt reply, Mat!

Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?

Thank you very much!

MatColgrove · May 29, 2020, 2:47pm

Using PGI_ACC_TIME or profiling tools, such as Nsight-systems, would certain be the quickest and easiest way to get this information. Nsight-compute can get you some very deep level insight about the performance of individual kernels.

Sans that, OpenACC does provide a profiling interface you can tap into (See section 5: https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf). Though this isn’t designed to end-users but rather an interface to profiling tools.

There’s also Nvidia’s CUPIT library (CUPTI :: CUPTI Documentation). Again it’s not really meant for end users but for use by profilers (it’s what PGI_ACC_TIME uses).

-Mat

Topic		Replies	Views
optimize runtime Legacy PGI Compilers	1	1336	March 23, 2018
Help making code perform better using GPU rather than CPU Legacy PGI Compilers	5	6615	September 2, 2010
Profiling OpenACC Legacy PGI Compilers	7	3781	May 30, 2019
Launch of the kernel Legacy PGI Compilers	4	2864	October 18, 2017
call to '_ZSt20__throw_length_errorPKc' with no acc routine Legacy PGI Compilers	5	7559	January 25, 2016
Wrong OpenACC data copyin times using pgcc ver 14.3 Legacy PGI Compilers	4	10934	May 21, 2014
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10790	March 15, 2013
Code not accelerated using acc kernels Legacy PGI Compilers	2	3400	January 30, 2017
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1671	March 28, 2023
cuModuleGetGlobal error Legacy PGI Compilers	12	6572	December 21, 2012

GPU time measuring using accel.h routines PGI 20.1

Related topics