Compiler appears to not privatize thread local private data

The intent with this code snippet is to parallelize j and i loops but not k loop. The indicated 1-D arrays (k) and scalars should be private to each (i,j) iteration. Any idea what is wrong with my construct? Thanks.

!$acc region copyin(dzmx,ns,is,ie,js,je,ng,km,iad,ktop,gama,cp,cappa,rdgas,dm2,pm2,quick_p,c_core,pt,bdt,seq,grg,rcp,rdt) &
!$acc copy(dz2,w) copyout(p3)
!$acc do parallel independent private(c2,p2,pt2,r_p,r_n,rden,dz,dm,wm,dts,pdt,m_bot,m_top,r_bot,r_top,time_left,pe1,pbar,wbar,dt,z_frac,t_left,a1,b1,g2,k2,ke,kt,k0,k1,k3)
do j = js,je ! j_loop
!$acc do parallel independent
do 6000 i=is,ie

do 5000 n=1,ns

dt = seq(n)

do k=ktop,km
dts(k) = -dz(k) / c2(k)
pdt(k) = dts(k)*(p2(k)-pm2(i,j,k))
r_p(k) = wm(k) + pdt(k)
r_n(k) = wm(k) - pdt(k)
enddo

do k=ktop+1,km+1
k2(k) = k-1
m_top(k) = 0.
r_top(k) = 0.
time_left(k) = dt
enddo

do 444 ke=km+1,ktop+1,-1
kt=k2(ke)
do k=kt,ktop,-1
z_frac = time_left(ke)/dts(k)
if ( z_frac <1> 2 ) then
k1 = ke-1
k2(k1) = k
m_top(k1) = m_top(ke) - dm(k1)
r_top(k1) = r_top(ke) - r_n(k1)
time_left(k1) = time_left(ke) + dts(k1)
endif
m_top(ke) = m_top(ke) + z_fracdm(k)
r_top(ke) = r_top(ke) + z_frac
r_n(k)
exit
else
time_left(ke) = time_left(ke) - dts(k)
m_top(ke) = m_top(ke) + dm(k)
r_top(ke) = r_top(ke) + r_n(k)
endif
enddo
if ( z_frac <= 1. ) cycle
if ( ke == ktop+1 ) exit
do k=ke-1,ktop+1,-1
m_top(k) = m_top(k+1) - dm(k)
r_top(k) = r_top(k+1) - r_n(k)
enddo
exit
444 continue

5000 continue
6000 continue
end do ! j_loop
!$acc end region

The compiler messages are:

PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (nh_core_cpu_mod.F90: 225)
riem_3d:
228, Loop is parallelizable
230, Loop is parallelizable
232, Complex loop carried dependence of ‘dts’ prevents parallelization
Loop carried dependence of ‘dts’ prevents parallelization
Loop carried backward dependence of ‘dts’ prevents vectorization
Loop carried reuse of ‘pdt’ prevents parallelization
Loop carried dependence of ‘r_n’ prevents parallelization
Complex loop carried dependence of ‘r_n’ prevents parallelization
Loop carried backward dependence of ‘r_n’ prevents vectorization
Complex loop carried dependence of ‘k2’ prevents parallelization
Loop carried dependence of ‘k2’ prevents parallelization
Loop carried backward dependence of ‘k2’ prevents vectorization
Loop carried dependence of ‘m_top’ prevents parallelization
Complex loop carried dependence of ‘m_top’ prevents parallelization
Loop carried backward dependence of ‘m_top’ prevents vectorization
Loop carried dependence of ‘r_top’ prevents parallelization
Complex loop carried dependence of ‘r_top’ prevents parallelization
Loop carried backward dependence of ‘r_top’ prevents vectorization
Loop carried dependence of ‘time_left’ prevents parallelization
Complex loop carried dependence of ‘time_left’ prevents parallelization
Loop carried backward dependence of ‘time_left’ prevents vectorization
Loop carried dependence of ‘k2’ prevents vectorization
Loop carried dependence of ‘m_top’ prevents vectorization
Loop carried dependence of ‘r_top’ prevents vectorization
Loop carried dependence of ‘time_left’ prevents vectorization
Loop carried scalar dependence for ‘z_frac’ at line 271
Accelerator kernel generated
228, !$acc do parallel ! blockidx%y
230, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
232, !$acc do seq(256)
Cached references to size [256] block of ‘seq’
236, Loop is parallelizable
243, Loop is parallelizable
250, Complex loop carried dependence of ‘k2’ prevents parallelization
Loop carried reuse of ‘k2’ prevents parallelization
Loop carried dependence of ‘k2’ prevents parallelization
Loop carried backward dependence of ‘k2’ prevents vectorization
Complex loop carried dependence of ‘m_top’ prevents parallelization
Loop carried dependence of ‘m_top’ prevents parallelization
Loop carried backward dependence of ‘m_top’ prevents vectorization
Complex loop carried dependence of ‘r_top’ prevents parallelization
Loop carried dependence of ‘r_top’ prevents parallelization
Loop carried backward dependence of ‘r_top’ prevents vectorization
Loop carried reuse of ‘r_top’ prevents parallelization
Complex loop carried dependence of ‘time_left’ prevents parallelization
Loop carried dependence of ‘time_left’ prevents parallelization
Loop carried backward dependence of ‘time_left’ prevents vectorization
Loop carried reuse of ‘time_left’ prevents parallelization
Loop carried scalar dependence for ‘z_frac’ at line 271
252, Complex loop carried dependence of ‘time_left’ prevents parallelization
Scalar last value needed after loop for ‘z_frac’ at line 262
Scalar last value needed after loop for ‘z_frac’ at line 263
Scalar last value needed after loop for ‘z_frac’ at line 271
Loop carried reuse of ‘time_left’ prevents parallelization
Complex loop carried dependence of ‘m_top’ prevents parallelization
Loop carried dependence of ‘m_top’ prevents parallelization
Loop carried backward dependence of ‘m_top’ prevents vectorization
Complex loop carried dependence of ‘r_top’ prevents parallelization
Loop carried reuse of ‘r_top’ prevents parallelization
Inner sequential loop scheduled on accelerator
253, Accelerator restriction: induction variable live-out from loop: k
266, Accelerator restriction: induction variable live-out from loop: k
267, Accelerator restriction: induction variable live-out from loop: k
268, Accelerator restriction: induction variable live-out from loop: k
270, Accelerator restriction: induction variable live-out from loop: k
273, Loop carried dependence of ‘m_top’ prevents parallelization
Loop carried backward dependence of ‘m_top’ prevents vectorization
Loop carried dependence of ‘r_top’ prevents parallelization
Loop carried backward dependence of ‘r_top’ prevents vectorization
Inner sequential loop scheduled on accelerator

Hi Aoloso,

Try moving the private clause to the “i” loop since this is the level you want the arrays privatized. I’d also use the kernel clause to make the body of “i” the kernel.

For example:

!$acc region copyin(dzmx,ns,is,ie,js,je,ng,km,iad,ktop,gama,cp,cappa,rdgas,dm2,pm2,quick_p,c_core,pt,bdt,seq,grg,rcp,rdt) &
!$acc copy(dz2,w) copyout(p3)
!$acc do parallel 
do j = js,je ! j_loop
!$acc do kernel private(c2,p2,pt2,r_p,r_n,rden,dz,dm,wm,dts,pdt,m_bot,m_top,r_bot,r_top,time_left,pe1,pbar,wbar,dt,z_frac,t_left,a1,b1,g2,k2,ke,kt,k0,k1,k3)
do 6000 i=is,ie

Note if the arrays are privatized, I don’t see the need to use independent here so I took these clauses out.

  • Mat