kernel execution time within source code (pgi accelerator)

I compile my Accelerator program using “-ta=nvidia,time” which gives a nice overview of the kernel execution time. However, I would like to have this information within my source code to compute GFlops or Speedups. Is there a function that I can use (prefereable in C) to measure just kernel execution time without data transfer. Where do I have to put it? (I cannot use GettimeofDay() within the accelerator region, can I?) Or do I have to write a parser for the output I get using -ta=nvidia,time?
Cheers, Sandra

Hi Sandra,

Unfortunately, we don’t have any runtime functions that you can use to obtain performance information. But it is a good idea, so I’ve sent in a Feature Request (TPR#17668) to see if they could be added.


We typically do use gettimeofday() calls in C to time the kernels. To time the kernels without the data movement, we use a data region with clauses to move the arrays, then put in gettimeofday() calls around the compute region.

Your idea is a good one, and we’d be glad for some feedback.
The times we collect are:

  • the total, min, max elapsed time spent in each region
  • the total, min, max elapsed time spent in each kernel
    We could probably provide a routine to return the total elapsed time spent in all regions so far, and all kernels so far; your program could then take differences to get the times you want.

Note, enabling the timing does perturb the behavior of the program, so we probably also want a routine to enable and disable the timing behavior dynamically. Would that suffice?

Hi Michael,
thanks for the good idea to use a data region to measure kernel time without data transfer!

However, a routine to get the runtimes would be of course more comfortable! The suggestions you made are totally sufficient (at least) for my purposes. And if you can enable and disable it dynamically it is even better.

Any ideas in which release it will be taken up?


Michael Has submitted some routines to help timings.

Michaels comments

In 11.4, we’ve added the following new routines to the accelerator runtime:
enables timers for accelerator regions and kernels
disables timers for accelerator regions and kernels
These two routines also affect the -ta=nvidia,time setting; that is
if you compile with -ta=nvidia,time, then call acc_disable_time(),
regions and kernels executed after that point will not be timed.
If you call acc_enable_time(), timing will resume.
Essentially, -ta=nvidia,time acts like an implicit call to
acc_enable_time() at the start of the program. These routines
are implemented with a counter; acc_enable_time increments the counter,
acc_disable_time() decrements it down to zero. As long as the counter
is > 0, timing is enabled.

unsigned long acc_total_time(acc_device_nvidia)
returns the total time spent in accelerator regions

unsigned long acc_exec_time(acc_device_nvidia)
returns the total time executing accelerator kernels

These should be available in the current 11.4 for use.

Thanks. I will test it as soon as we get the 11.4 release.

maybe I don’t understand the question well…but why not use cudaEvents? you can place them anywhere in the code and measure time with or without data transfer…
for example :

istat = cudaEventCreate(cstart)
istat = cudaEventCreate(cstop)

istat = cudaEventRecord(cstart, 0)
call yourOwnKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, size)
istat = cudaEventRecord(cstop, 0)
istat = cudaThreadSynchronize()

istat = cudaEventElapsedTime(ctime, cstart, cstop)
istat = cudaEventDestroy(cstart)
istat = cudaEventDestroy(cstop)

Hi FederMor,

Xray is using the PGI Accelerator Model, not CUDA Fortran. The Accelerator Model is meant to be generic so she was request a method that was not tied to a particular hardware. It was a good idea so we added them.



I am able to use acc_exec_time() and acc_total_time(). Is there any routine to know the data transfer time as well?

I can get it in the form of “total”, “kernels” and “data” by using pgcollect, but I wish to capture it in the form of variables which can be used further. Can we do that?