profiling individual subroutines

I’ve measured overall speedups of the entire code with “time ./slab”, but I want to measure the speedup of an individual subroutine. Since accelerator directives are only in such subroutine in question, I’d like to know its speedup rather then the entire code. Now, I know I can calculate this by assuming that the rest of the code takes the same time to run, but is there a way to measure it directly?

Using PGI_ACC_TIME=1 I guess the bolded number below shows me the total time spent in accelerator regions, including kernel and data transfers. Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.

Accelerator Kernel Timing data
/home/ben/slab_support/slab.f
  ppush  NVIDIA  devicenum=0
    [b]time(us): 10,855,571[/b]
    276: compute region reached 40 times
        276: data copyin reached 520 times
             device time(us): total=3,714,420 max=10,758 min=6 avg=7,143
        277: kernel launched 40 times
            grid: [65535]  block: [128]
             device time(us): total=3,868,258 max=114,120 min=91,532 avg=96,706
            elapsed time(us): total=3,869,863 max=114,154 min=91,566 avg=96,746
        363: data copyout reached 320 times
             device time(us): total=3,272,893 max=10,823 min=10,170 avg=10,227

However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.

I’d like to use the pgprofiler, but I don’t know how to find the wallclock time of cpush and ppush based on the profile, as I get functions like __select_nocancel. Apparently the 169 seconds of __select_nocancel happens somewhere inside ppush or cpush, but I don’t know how exactly.

Ben

Hi Ben,

In 13.7, you’ll be able to use ‘pgcollect’ to create a mixed Host/Device profile and then view the results in PGPROF. Hopefully this will give you an easier method to extract the information you are looking for.

Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.

This gives you the total time spent in this region, including kernel, data, nested regions, and even CPU time.

However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.

You may be encountering the pinned memory issue I discuss here:Different Performance by 13.xx ver. or some other CUDA/device overhead issue which is not measured by PGI_ACC_TIME. For this detail, you’d need to use NVVP.

  • Mat