Issue on profiling accelerator kernels

Can anyone explain the profiled output shown below?
As you can see, there is a huge gap between “region” time
and the sum of “kernels” time and “data” time.

PGI user guide (PGI2011 User guide, chapter 7, Profiling Accelerator kernels (page 105)) says that the execution time (“regions” time) is divided into kernel execution time (“kernels” time) and data transfer time between the host and GPU (“data” time), but below output shows that something big occurs during the region time.

Can anyone explain why “regions” time can be much bigger than the sum of “kernels” time and “data” time?
(Accelerator regions are called repeatedly in a nested loop, but the loop does not contain any CPU code except for the loop itself, no data are transferred between CPU and GPU.)

[Added later]

  1. Manually written CUDA program shows kernel execution times similar to the ones shown below, and each accelerator loop uses many private scalar variables, which may mean that PGI Accelerator compiler generates very inefficient code to handle loop-private data.
  2. pgprof shows that about 49 seconds are spent for routine ‘sstk’, I don’t know why it takes so long time (The total execution time of the manual version is less than 8 seconds.)
  3. CUDA visual profiler also shows that there are non-trivial time gap between each kernel invocation, even though no data transfers occur between kernel calls.

    Test environment:
    PGI comipler version: 11.5
    GPU: Tesla C2050 with runtime version 4.0
    CPU: Intel Xeon CPU E5520 @2.27GHz
    OS: Cent OS release 5.6

Accelerator Kernel Timing data
598: region entered 6000 times
time(us): total=2871953 init=1222 region=2870731
kernels=713274 data=0
w/o init: total=2870731 max=716 min=427 avg=478
599: kernel launched 6000 times
grid: [1512] block: [128]
time(us): total=713274 max=130 min=117 avg=118
440: region entered 6000 times
time(us): total=56968526 init=1170 region=56967356
kernels=7977452 data=0
w/o init: total=56967356 max=16848 min=8962 avg=9494
441: kernel launched 6000 times
grid: [1512] block: [128]
time(us): total=7977452 max=1377 min=1243 avg=1329
410: region entered 2000 times
time(us): total=3495504 init=384 region=3495120
kernels=334715 data=0
w/o init: total=3495120 max=1980 min=1672 avg=1747
411: kernel launched 2000 times
grid: [1512] block: [128]
time(us): total=334715 max=187 min=166 avg=167
402: region entered 2000 times
time(us): total=664865 init=376 region=664489
kernels=244456 data=0
w/o init: total=664489 max=408 min=319 avg=332
403: kernel launched 2000 times
grid: [7560] block: [128]
time(us): total=244456 max=126 min=121 avg=122
385: region entered 1 time
time(us): total=489
kernels=61 data=0
386: kernel launched 1 times
grid: [1512] block: [128]
time(us): total=61 max=61 min=61 avg=61
377: region entered 1 time
time(us): total=64112060 init=94186 region=64017874
w/o init: total=64017874 max=64017874 min=64017874 avg=64017874

The above problem disappeared if all private clauses are removed; it seems that the PGI Accelerator compiler has a bug in handling private clauses. For the tested program, both versions with explicit private clauses and without private causes generate the same, correct outputs, but the version with explicit private data clauses takes much longer time than the one without private clauses (63 sec v.s. 9 sec).

Hi Seyong,

Is it possible for you to post or send to PGI Customer Service ( a specific example?