profiling individual subroutines

brushman · June 11, 2013, 10:13pm

I’ve measured overall speedups of the entire code with “time ./slab”, but I want to measure the speedup of an individual subroutine. Since accelerator directives are only in such subroutine in question, I’d like to know its speedup rather then the entire code. Now, I know I can calculate this by assuming that the rest of the code takes the same time to run, but is there a way to measure it directly?

Using PGI_ACC_TIME=1 I guess the bolded number below shows me the total time spent in accelerator regions, including kernel and data transfers. Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.

Accelerator Kernel Timing data
/home/ben/slab_support/slab.f
  ppush  NVIDIA  devicenum=0
    [b]time(us): 10,855,571[/b]
    276: compute region reached 40 times
        276: data copyin reached 520 times
             device time(us): total=3,714,420 max=10,758 min=6 avg=7,143
        277: kernel launched 40 times
            grid: [65535]  block: [128]
             device time(us): total=3,868,258 max=114,120 min=91,532 avg=96,706
            elapsed time(us): total=3,869,863 max=114,154 min=91,566 avg=96,746
        363: data copyout reached 320 times
             device time(us): total=3,272,893 max=10,823 min=10,170 avg=10,227

However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.

I’d like to use the pgprofiler, but I don’t know how to find the wallclock time of cpush and ppush based on the profile, as I get functions like __select_nocancel. Apparently the 169 seconds of __select_nocancel happens somewhere inside ppush or cpush, but I don’t know how exactly.

Ben

MatColgrove · June 11, 2013, 11:15pm

Hi Ben,

In 13.7, you’ll be able to use ‘pgcollect’ to create a mixed Host/Device profile and then view the results in PGPROF. Hopefully this will give you an easier method to extract the information you are looking for.

Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.

This gives you the total time spent in this region, including kernel, data, nested regions, and even CPU time.

However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.

You may be encountering the pinned memory issue I discuss here:Different Performance by 13.xx ver. or some other CUDA/device overhead issue which is not measured by PGI_ACC_TIME. For this detail, you’d need to use NVVP.

Mat

Topic		Replies	Views
Issue on profiling accelerator kernels Legacy PGI Compilers	2	5884	June 21, 2011
kernel execution time within source code (pgi accelerator) Legacy PGI Compilers	8	7768	November 29, 2011
Pgprof and accelerators Legacy PGI Compilers	3	7396	January 26, 2010
profiling within one subroutine Legacy PGI Compilers	2	9507	July 3, 2008
Accelerator Kernel Timing info Legacy PGI Compilers	3	3476	December 31, 2010
Very slow performance of some loops Legacy PGI Compilers	3	2843	July 22, 2011
GPU time measuring using accel.h routines PGI 20.1 Legacy PGI Compilers	5	704	May 29, 2020
pgprof doesn't profile accelerated code Legacy PGI Compilers	4	8679	February 28, 2013
profiler inconsistency? Legacy PGI Compilers	3	11330	March 17, 2007
Problem interpreting -ta=time output Legacy PGI Compilers	3	3942	September 9, 2010

profiling individual subroutines

Related topics