Hi,
I try to use PGI_ACC_TIME to profile an openacc code on V100. The kernel includes two matrix-matrix multiplications, which is written as
!$ACC DATA PRESENT(w,u,gxyz,ur,us,ut,wk,dxm1,dxtm1)
!$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR PRIVATE(wr,ws,wt)
!DIR NOBLOCKING
do e = 1,nelt
do k=1,nz1
do j=1,ny1
do i=1,nx1
wr = 0
ws = 0
wt = 0
!$ACC LOOP SEQ
do l=1,nx1
...
!$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR
do e=1,nelt
do k=1,nz1
do j=1,ny1
do i=1,nx1
w(i,j,k,e) = 0.0
!$ACC LOOP SEQ
...
On V100 GPU, the output is
...
ax_acc NVIDIA devicenum=0
time(us): 66,126
421: data region reached 400 times
457: compute region reached 200 times
457: kernel launched 200 times
grid: [8192] block: [32x4]
device time(us): total=36,985 max=190 min=177 avg=184
elapsed time(us): total=43,155 max=222 min=208 avg=215
487: compute region reached 200 times
487: kernel launched 200 times
grid: [8192] block: [32x4]
device time(us): total=29,141 max=148 min=138 avg=145
elapsed time(us): total=35,290 max=186 min=168 avg=176
On P100, the output
ax_acc NVIDIA devicenum=0
time(us): 0
421: data region reached 400 times
457: compute region reached 200 times
457: kernel launched 200 times
grid: [8192] block: [32x4]
elapsed time(us): total=96,545 max=515 min=469 avg=482
487: compute region reached 200 times
487: kernel launched 200 times
grid: [8192] block: [32x4]
elapsed time(us): total=79,879 max=427 min=385 avg=399
All variables are “present” on the device, why there are device times on V100 but not on P100 GPU? How can we avoid it?
device time(us): total=36,985 max=190 min=177 avg=184
device time(us): total=29,141 max=148 min=138 avg=145
Thanks. /JG