pgprof hangs when profiling GPU kernel


I have an application that uses CUDA Fortran and runs without problems. There are 3 kernels I want to profile to get some metrics. The entire application runs in about 30 s with CUDA, but when profiling to get one metric of one kernel it already took 15 min running, then I canceled the profiling. The standard profiling (-o output_profiling) takes about the same time as the application. Is this delay normal?

Compilation flags: -O2 -Mcuda
pgprof command: pgprof --cpu-profiling off --kernels kernel1 --metrics metric1 ./app input

kernel1 is the name of the subroutine inside the kernel module, and metric1 is a metric listed by “pgprof --query-events”.

Do I need to compile the code with other flags?

When I canceled the profiling, a line of “Metric result:” appeared followed by the device name, kernel name, metric name and description, and min/max/ave values, so I think the profiling was being performed.


Hi Henrique,

Is this delay normal?

For a single metric, no, I wouldn’t expect this much of a delay. It shouldn’t really take any longer than doing the standard profile. If you were doing “all” metrics, then it does take quite a bit longer since the kernel needs to be launched multiple times to gather all the hardware statistics. Unfortunately, I don’t know what’s causing the delay with a single metric.

Note that Pgprof is really just a re-branded nvprof, with a few different defaults (like CPU profiling on) and different releases. If you a CUDA SDK installed, you might try using nvprof instead as a check.

nvprof and eventually pgprof as well, are in the process of being replaced by NSight-Systems and NSight-Compute. Where Systems gives you the standard profile and Compute giving you the metrics. So using Nsight-Compute may be another option. See this page for more details:


I have a local installation of PGI in my home (pgf90 of version is 19.10-0), and I saw that the pgi directory has 6 different nvprof (llvm/nollvm each with 10.1/10.0/9.2). Only version 9.2 works but hangs, and the others give the error “Error: incompatible CUDA driver version.”. This error also appears if I use the nvprof installed on the system, which is 10.0.130. Besides, I don’t have privileges to install any software on the system.

The kernels I want to profile are called millions of times, so maybe the delay occurs because of this number of routine calls. Do you think that is the reason?

The kernels I want to profile are called millions of times, so maybe the delay occurs because of this number of routine calls. Do you think that is the reason?

It could be the cause. Even with one metric there is some overhead to collect the hardware counters.

One thing you can try is add calls to “cudaProfilerStart” and “cudaProfilerStop” to your code. Start the profile after about 100,000 calls, and stop after about 100,100 (or there about). Assuming the characteristics don’t change much for each kernel launch, measuring 100 kernel launches should be more than enough.

Note that you’ll need to also add the flag “–profile-from-start off” to enable the calls.


After using those functions I was able to collect the metrics, and the application finished properly. I have set the collection for 1% of the total number of iterations, and the run took approximately 9 min. As I mentioned before, the application optimized with CUDA takes about 30 s, so an entire run for profiling with pgprof would take about 900 min, and that is why it was taking too long. Besides, I noticed that metrics of memory are collected much faster than metrics of flops. These long runtimes are from runs to collect flops.

I will use the average metric value returned by pgprof for a small number of iterations to compute the average metric for the entire run of the application.

I saw that I can also use CUPTI to collect more precise metrics, and I want to avoid using average values because they may not represent the actual execution of each routine. Do you know if PGI includes the CUPTI libraries for use in the code? Is there any documentation?

Thanks again for you help.

Yes, PGI includes CUPTI as part of our packages. It’s the device profiling library that we use for PGPROF/NVPROF as well as our PGI_ACC_TIME environment variable for quick command line profiling.

It can be found under the “$PGI/2019/cuda/[9.2|10.0|10.1]/lib64” directories, depending on the CUDA version of your driver.

For documentation see: