Problem interpreting -ta=time output


I am having trouble understanding the output of the -ta=time option for pgfortran.

The output is:

    586: region entered 8 times
        time(us): total=383179 init=12 region=383167
                  kernels=28369 data=153875
        w/o init: total=383167 max=165873 min=4553 avg=47895
        595: kernel launched 8 times
            grid: [1-64x1-64]  block: [16x4x4]
            time(us): total=14036 max=12196 min=12 avg=1754
        601: kernel launched 8 times
            grid: [1-64x1-64]  block: [16x4x4]
            time(us): total=14333 max=12426 min=10 avg=1791

I can see that “region”=“total”-“init”, but “kernels”+“data” is nothing like “region”. Where is the rest of the time going, if not on “kernels” or “data”?




I would try some experiments.

  1. If the host you are running has multiple CPUs, I would use you preference method to bind to one particular core to prevent the program jumping around.

  2. There might be some granularity issue. The region time is measured from the host. The kernel and data times are from the device. I would try to make the test run longer to see if the difference is smaller.


Thanks. What I am trying to do is show how fast the code would go if I could cut out the data transfer (i.e. make the data resident on the accelerator between subroutine calls). What would be the best number to quote: summed “kernels” times or total program time minus summed “data” times, or something else?

In response to your suggestions, I used “taskset” on the Linux host to bind the process. I also made sure it was not using the OpenMP directives in the code. There was still a discrepancy.

Regarding granularity, another kernel in the same code had these stats:

    1227: region entered 4 times
        time(us): total=363636 init=6 region=363630
                  kernels=13323 data=182556

These times are thousands of microseconds per call. This would be very poor granularity.



Hi Alistair,

I find this odd as well. I’ve tried my best to recreate the mismatch times here, but as of yet no luck. The profiling information is collected via calls to the CUDA driver so there could be a problem in how this information is collected or it could be something specific to your system. If you’re able to share the code, I’d be interested in seeing if I can recreate the problem here.

Another tack would be to not use “-ta=time” and instead use the CUDA profiler. Full documetnation can be found HERE but the simplest method is to set the following environment variables and then run your program.

    setenv CUDA_PROFILE 1
    setenv CUDA_PROFILE_CSV 1
    setenv CUDA_PROFILE_LOG myProfieFileName.csv.%d

You can change CUDA_PROFILE_LOG to what ever file name you wish. The “%d” refers the device used. If multiple devices are used, then multiple profiles will be created.

Hope this helps,