I’m trying to measure gpu time using the following routines from accel.h with PGI 20.1:
acc_enable_time( acc_device_nvidia );
...
long etime = acc_exec_time( acc_device_nvidia );
long ttime = acc_total_time( acc_device_nvidia );
acc_disable_time( acc_device_nvidia);
But the values returned by the etime and ttime always gives me the same value (2 us), no matter the program or the number of kernels executed in the device.
What I’m doing wrong ?
example:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#include <accel.h>
#define N 100000000
int main() {
double pi = 0.0; long i;
acc_init(acc_device_nvidia);
acc_enable_time(acc_device_nvidia);
#pragma acc parallel loop reduction(+:pi)
for (i=0; i<N; i++){
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf("pi = %f\n", pi/N);
long etime = acc_exec_time(acc_device_nvidia);
long ttime = acc_total_time(acc_device_nvidia);
acc_disable_time(acc_device_nvidia);
fprintf(stderr, "Execution time in device kernels: %ld\n", etime);
fprintf(stderr, "Total time in device regions: %ld\n", ttime);
return 0;
}
execution result:
pi = 3.141593
Execution time in device kernels: 2
Total time in device regions: 2
Hmm, those routines are very old and were part of the PGI Accelerator Model which has been deprecated. We’ve also since have added better ways to profile kernels either via the environment variable “PGI_ACC_TIME” or NVIDIA profilers such as Nsight Compute and Nsight Systems.
Can you try using PGI_ACC_TIME and see if this gives you the information you’re looking for?
% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#define N 100000000
int main() {
double pi = 0.0; long i;
#pragma acc parallel loop reduction(+:pi)
for (i=0; i<N; i++){
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf("pi = %f\n", pi/N);
return 0;
}
% setenv PGI_ACC_TIME 1
% pgcc -ta=tesla test.c -Minfo=accel; a.out
main:
10, Generating Tesla code
10, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Generating reduction(+:pi)
10, Generating implicit copy(pi) [if not already present]
pi = 3.141593
Accelerator Kernel Timing data
test.c
main NVIDIA devicenum=0
time(us): 825
10: compute region reached 1 time
10: kernel launched 1 time
grid: [65535] block: [128]
device time(us): total=678 max=678 min=678 avg=678
elapsed time(us): total=763 max=763 min=763 avg=763
10: reduction kernel launched 1 time
grid: [1] block: [256]
device time(us): total=113 max=113 min=113 avg=113
elapsed time(us): total=135 max=135 min=135 avg=135
10: data region reached 2 times
10: data copyin transfers: 1
device time(us): total=6 max=6 min=6 avg=6
13: data copyout transfers: 1
device time(us): total=28 max=28 min=28 avg=28
Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?
Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?
Certainly, setting the PGI_ACC_TIME variable works and reports correct values at the end of the execution. However, we are aiming at instrumenting the code, and would like to get measurements of the time spent by the device executing each individual kernel. Would there be any other alternative to measure this at runtime within the code, without using external tools?
Using PGI_ACC_TIME or profiling tools, such as Nsight-systems, would certain be the quickest and easiest way to get this information. Nsight-compute can get you some very deep level insight about the performance of individual kernels.
There’s also Nvidia’s CUPIT library (CUPTI :: CUPTI Documentation). Again it’s not really meant for end users but for use by profilers (it’s what PGI_ACC_TIME uses).