kernel execution time within source code (pgi accelerator)

sWienke · February 4, 2011, 3:37pm

Hello,
I compile my Accelerator program using “-ta=nvidia,time” which gives a nice overview of the kernel execution time. However, I would like to have this information within my source code to compute GFlops or Speedups. Is there a function that I can use (prefereable in C) to measure just kernel execution time without data transfer. Where do I have to put it? (I cannot use GettimeofDay() within the accelerator region, can I?) Or do I have to write a parser for the output I get using -ta=nvidia,time?
Cheers, Sandra

MatColgrove · February 4, 2011, 7:34pm

Hi Sandra,

Unfortunately, we don’t have any runtime functions that you can use to obtain performance information. But it is a good idea, so I’ve sent in a Feature Request (TPR#17668) to see if they could be added.

Thanks,
Mat

mwolfe · February 9, 2011, 9:03pm

We typically do use gettimeofday() calls in C to time the kernels. To time the kernels without the data movement, we use a data region with clauses to move the arrays, then put in gettimeofday() calls around the compute region.

Your idea is a good one, and we’d be glad for some feedback.
The times we collect are:

the total, min, max elapsed time spent in each region
the total, min, max elapsed time spent in each kernel
We could probably provide a routine to return the total elapsed time spent in all regions so far, and all kernels so far; your program could then take differences to get the times you want.

Note, enabling the timing does perturb the behavior of the program, so we probably also want a routine to enable and disable the timing behavior dynamically. Would that suffice?

sWienke · February 11, 2011, 8:08am

Hi Michael,
thanks for the good idea to use a data region to measure kernel time without data transfer!

However, a routine to get the runtimes would be of course more comfortable! The suggestions you made are totally sufficient (at least) for my purposes. And if you can enable and disable it dynamically it is even better.

Any ideas in which release it will be taken up?

tull · April 11, 2011, 11:36pm

Hello,

Michael Has submitted some routines to help timings.

Michaels comments

In 11.4, we’ve added the following new routines to the accelerator runtime:
acc_enable_time(acc_device_nvidia)
enables timers for accelerator regions and kernels
acc_disable_time(acc_device_nvidia)
disables timers for accelerator regions and kernels
These two routines also affect the -ta=nvidia,time setting; that is
if you compile with -ta=nvidia,time, then call acc_disable_time(),
regions and kernels executed after that point will not be timed.
If you call acc_enable_time(), timing will resume.
Essentially, -ta=nvidia,time acts like an implicit call to
acc_enable_time() at the start of the program. These routines
are implemented with a counter; acc_enable_time increments the counter,
acc_disable_time() decrements it down to zero. As long as the counter
is > 0, timing is enabled.

unsigned long acc_total_time(acc_device_nvidia)
returns the total time spent in accelerator regions

unsigned long acc_exec_time(acc_device_nvidia)
returns the total time executing accelerator kernels

These should be available in the current 11.4 for use.
regards,
dave

sWienke · April 13, 2011, 6:32am

Thanks. I will test it as soon as we get the 11.4 release.

FederMor · June 8, 2011, 12:51pm

maybe I don’t understand the question well…but why not use cudaEvents? you can place them anywhere in the code and measure time with or without data transfer…
for example :

istat = cudaEventCreate(cstart)
istat = cudaEventCreate(cstop)

istat = cudaEventRecord(cstart, 0)
call yourOwnKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, size)
istat = cudaEventRecord(cstop, 0)
istat = cudaThreadSynchronize()

istat = cudaEventElapsedTime(ctime, cstart, cstop)
istat = cudaEventDestroy(cstart)
istat = cudaEventDestroy(cstop)

MatColgrove · June 8, 2011, 5:12pm

Hi FederMor,

Xray is using the PGI Accelerator Model, not CUDA Fortran. The Accelerator Model is meant to be generic so she was request a method that was not tied to a particular hardware. It was a good idea so we added them.

-Mat

WalS · November 29, 2011, 9:12am

Hello,

I am able to use acc_exec_time() and acc_total_time(). Is there any routine to know the data transfer time as well?

I can get it in the form of “total”, “kernels” and “data” by using pgcollect, but I wish to capture it in the form of variables which can be used further. Can we do that?

Regards

Topic		Replies	Views
Accelerator Kernel Timing info Legacy PGI Compilers	3	3432	December 31, 2010
Issues with measuring speedup timing analysis for CUDA CUDA Programming and Performance	0	733	July 3, 2010
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6065	September 8, 2009
timing kernel execution with clock() CUDA Programming and Performance	6	3745	July 6, 2009
less speed of accelerator directives Legacy PGI Compilers	6	3508	March 26, 2012
CUDA OpenCL comparison CUDA Programming and Performance	9	3411	August 23, 2011
Timing inside the kernel How to measure times inside the kernel? CUDA Programming and Performance	10	12087	December 21, 2009
timing a kernel in CUDA Dortran Legacy PGI Compilers	2	3151	October 14, 2010
How to get the exec. time inner the kernel function? Nsight Compute cuda , kernel , profiling	6	996	February 27, 2023
GPU time measuring using accel.h routines PGI 20.1 Legacy PGI Compilers	5	656	May 29, 2020

kernel execution time within source code (pgi accelerator)

Michael Has submitted some routines to help timings.

Michaels comments

unsigned long acc_exec_time(acc_device_nvidia) returns the total time executing accelerator kernels

Related topics

unsigned long acc_exec_time(acc_device_nvidia)
returns the total time executing accelerator kernels