profiling individual subroutines

Hi Ben,

In 13.7, you’ll be able to use ‘pgcollect’ to create a mixed Host/Device profile and then view the results in PGPROF. Hopefully this will give you an easier method to extract the information you are looking for.

Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.

This gives you the total time spent in this region, including kernel, data, nested regions, and even CPU time.

However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.

You may be encountering the pinned memory issue I discuss here:Different Performance by 13.xx ver. or some other CUDA/device overhead issue which is not measured by PGI_ACC_TIME. For this detail, you’d need to use NVVP.

  • Mat