I’ve measured overall speedups of the entire code with “time ./slab”, but I want to measure the speedup of an individual subroutine. Since accelerator directives are only in such subroutine in question, I’d like to know its speedup rather then the entire code. Now, I know I can calculate this by assuming that the rest of the code takes the same time to run, but is there a way to measure it directly?
Using PGI_ACC_TIME=1 I guess the bolded number below shows me the total time spent in accelerator regions, including kernel and data transfers. Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.
Accelerator Kernel Timing data /home/ben/slab_support/slab.f ppush NVIDIA devicenum=0 [b]time(us): 10,855,571[/b] 276: compute region reached 40 times 276: data copyin reached 520 times device time(us): total=3,714,420 max=10,758 min=6 avg=7,143 277: kernel launched 40 times grid:  block:  device time(us): total=3,868,258 max=114,120 min=91,532 avg=96,706 elapsed time(us): total=3,869,863 max=114,154 min=91,566 avg=96,746 363: data copyout reached 320 times device time(us): total=3,272,893 max=10,823 min=10,170 avg=10,227
However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.
I’d like to use the pgprofiler, but I don’t know how to find the wallclock time of cpush and ppush based on the profile, as I get functions like __select_nocancel. Apparently the 169 seconds of __select_nocancel happens somewhere inside ppush or cpush, but I don’t know how exactly.