The output of PGI_ACC_TIME on V100 GPU

Hi,
I try to use PGI_ACC_TIME to profile an openacc code on V100. The kernel includes two matrix-matrix multiplications, which is written as

!$ACC DATA PRESENT(w,u,gxyz,ur,us,ut,wk,dxm1,dxtm1)
!$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR PRIVATE(wr,ws,wt)
!DIR NOBLOCKING
      do e = 1,nelt
         do k=1,nz1
         do j=1,ny1
         do i=1,nx1
            wr = 0
            ws = 0
            wt = 0
!$ACC LOOP SEQ
            do l=1,nx1  
...
!$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR 
      do e=1,nelt
         do k=1,nz1
         do j=1,ny1
         do i=1,nx1
            w(i,j,k,e) = 0.0
!$ACC LOOP SEQ
...

On V100 GPU, the output is

...
  ax_acc  NVIDIA  devicenum=0
    time(us): 66,126
    421: data region reached 400 times
    457: compute region reached 200 times
        457: kernel launched 200 times
            grid: [8192]  block: [32x4]
           device time(us): total=36,985 max=190 min=177 avg=184
            elapsed time(us): total=43,155 max=222 min=208 avg=215
    487: compute region reached 200 times
        487: kernel launched 200 times
            grid: [8192]  block: [32x4]
             device time(us): total=29,141 max=148 min=138 avg=145
            elapsed time(us): total=35,290 max=186 min=168 avg=176

On P100, the output

 ax_acc  NVIDIA  devicenum=0
    time(us): 0
    421: data region reached 400 times
    457: compute region reached 200 times
        457: kernel launched 200 times
            grid: [8192]  block: [32x4]
            elapsed time(us): total=96,545 max=515 min=469 avg=482
    487: compute region reached 200 times
        487: kernel launched 200 times
            grid: [8192]  block: [32x4]
            elapsed time(us): total=79,879 max=427 min=385 avg=399

All variables are “present” on the device, why there are device times on V100 but not on P100 GPU? How can we avoid it?

device time(us): total=36,985 max=190 min=177 avg=184
device time(us): total=29,141 max=148 min=138 avg=145

Thanks. /JG

Hi jigo3635,

The P100 did run on the device, but for some reason the runtime couldn’t find the device profiler shard object, libcupti.so. The elapsed time is measured from the host while we use libcupti (when found) to time the device time. Elapsed time is inclusive of the device time plus some overhead.

libcupti can be found under the “CUDA” directories we ship with the compilers. The exact directory will change depending on which compiler version you’re using, but look under “$PGI/201[8|9]/cuda//lib64”. Setting you’re LD_LIBRARY_PATH to the CUDA lib directory that matches the version you used to compile.

If this is a system where you don’t have PGI installed, you can download libcupti from: https://developer.nvidia.com/CUPTI


Hope this helps,
Mat

Hi Mat,
Thanks for your responses.

Setting you’re LD_LIBRARY_PATH to the CUDA lib directory that matches the version you used to compile.
The CUDA lib directory seem to aleardy to set but still get the device time on V100,

$echo $LD_LIBRARY_PATH
/apps/PGI/2018-1810/linuxpower/2018/mpi/openmpi-2.1.2/lib:/apps/PGI/2018-1810/linuxpower/18.10/lib:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/nvvm/lib64:/usr/local/cuda-10.0/lib64:/usr/lib64/nvidia/xorg
$ ls /usr/local/cuda-10.0/extras/CUPTI/lib64/
libcupti.so  libcupti.so.10.0  libcupti.so.10.0.130

Any idea about this ? Thanks. /JG

I’m not positive. Maybe it’s a mismatch in the CUDA versions that were used to build the binary and the CUDA 10.0 libcupti?

-Mat