The output of PGI_ACC_TIME on V100 GPU

gongjing · March 19, 2019, 11:07am

Hi,
I try to use PGI_ACC_TIME to profile an openacc code on V100. The kernel includes two matrix-matrix multiplications, which is written as

!$ACC DATA PRESENT(w,u,gxyz,ur,us,ut,wk,dxm1,dxtm1)
!$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR PRIVATE(wr,ws,wt)
!DIR NOBLOCKING
      do e = 1,nelt
         do k=1,nz1
         do j=1,ny1
         do i=1,nx1
            wr = 0
            ws = 0
            wt = 0
!$ACC LOOP SEQ
            do l=1,nx1  
...
!$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR 
      do e=1,nelt
         do k=1,nz1
         do j=1,ny1
         do i=1,nx1
            w(i,j,k,e) = 0.0
!$ACC LOOP SEQ
...

On V100 GPU, the output is

...
  ax_acc  NVIDIA  devicenum=0
    time(us): 66,126
    421: data region reached 400 times
    457: compute region reached 200 times
        457: kernel launched 200 times
            grid: [8192]  block: [32x4]
           device time(us): total=36,985 max=190 min=177 avg=184
            elapsed time(us): total=43,155 max=222 min=208 avg=215
    487: compute region reached 200 times
        487: kernel launched 200 times
            grid: [8192]  block: [32x4]
             device time(us): total=29,141 max=148 min=138 avg=145
            elapsed time(us): total=35,290 max=186 min=168 avg=176

On P100, the output

 ax_acc  NVIDIA  devicenum=0
    time(us): 0
    421: data region reached 400 times
    457: compute region reached 200 times
        457: kernel launched 200 times
            grid: [8192]  block: [32x4]
            elapsed time(us): total=96,545 max=515 min=469 avg=482
    487: compute region reached 200 times
        487: kernel launched 200 times
            grid: [8192]  block: [32x4]
            elapsed time(us): total=79,879 max=427 min=385 avg=399

All variables are “present” on the device, why there are device times on V100 but not on P100 GPU? How can we avoid it?

device time(us): total=36,985 max=190 min=177 avg=184
device time(us): total=29,141 max=148 min=138 avg=145

Thanks. /JG

MatColgrove · March 19, 2019, 3:10pm

Hi jigo3635,

The P100 did run on the device, but for some reason the runtime couldn’t find the device profiler shard object, libcupti.so. The elapsed time is measured from the host while we use libcupti (when found) to time the device time. Elapsed time is inclusive of the device time plus some overhead.

libcupti can be found under the “CUDA” directories we ship with the compilers. The exact directory will change depending on which compiler version you’re using, but look under “$PGI/201[8|9]/cuda//lib64”. Setting you’re LD_LIBRARY_PATH to the CUDA lib directory that matches the version you used to compile.

If this is a system where you don’t have PGI installed, you can download libcupti from: https://developer.nvidia.com/CUPTI

Hope this helps,
Mat

gongjing · March 19, 2019, 3:36pm

Hi Mat,
Thanks for your responses.

Setting you’re LD_LIBRARY_PATH to the CUDA lib directory that matches the version you used to compile.
The CUDA lib directory seem to aleardy to set but still get the device time on V100,

$echo $LD_LIBRARY_PATH
/apps/PGI/2018-1810/linuxpower/2018/mpi/openmpi-2.1.2/lib:/apps/PGI/2018-1810/linuxpower/18.10/lib:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/nvvm/lib64:/usr/local/cuda-10.0/lib64:/usr/lib64/nvidia/xorg
$ ls /usr/local/cuda-10.0/extras/CUPTI/lib64/
libcupti.so  libcupti.so.10.0  libcupti.so.10.0.130

Any idea about this ? Thanks. /JG

MatColgrove · March 22, 2019, 4:54pm

I’m not positive. Maybe it’s a mismatch in the CUDA versions that were used to build the binary and the CUDA 10.0 libcupti?

-Mat

Topic		Replies	Views
GPU time measuring using accel.h routines PGI 20.1 Legacy PGI Compilers	5	769	May 29, 2020
Profiling OpenACC Legacy PGI Compilers	7	3933	May 30, 2019
pgcollect + openacc , not working with pgi14.X Legacy PGI Compilers	9	16358	May 20, 2015
finding executed time using PGI_ACC_TIME Legacy PGI Compilers	1	2660	February 10, 2014
pgprof/pgcollect : problem with CPU+openacc on same routine Legacy PGI Compilers	2	7319	November 24, 2014
Check performance Legacy PGI Compilers	4	3333	September 28, 2017
PGI_ACC_TIME kills application Legacy PGI Compilers	4	5393	August 5, 2016
OpenACC doesn't accelerate in my computer Legacy PGI Compilers	2	2241	November 15, 2017
Problem with -ta=nvidia,time Legacy PGI Compilers	3	8239	March 11, 2010
Environment variable PGI_ACC_TIME accelerates process Legacy PGI Compilers	7	1259	November 5, 2021

The output of PGI_ACC_TIME on V100 GPU

Related topics