Hi,
I’m trying to port a large code from the CPU to the GPU using OpenACC.
My idea was to replace the OpenMP directives by the OpenACC kernels directives. But according to the PGI compiler output the code below seems difficult to parallelize. Although it provides a lot of parallelism. I wanted to start making the whole loop construct parallel by starting at the inside.
I wanted to improve the performance by placing the private clause.
But I don’t understand where to place it.
The two inner most loops k and l can be made parallel. Each worker (thread) accesses one element of the array dfc_.
But before the two innermost loops some data needs to be copied into the dfc_ array.
Now my question is where to place the private clause? I placed it before the j loop.
But this does not improve the performance. Is this placement correct?
Thank you for your help
! old OpenMP code
!!$OMP PARALLEL &
!!$OMP PRIVATE(i,j,k,l,ii,jj,ssi,j0,iicolor,icolor)
!$acc kernels present(df,f,A)
do iicolor = 1,ncolor*2
icolor = iicolor_no(iicolor)
do jj = 1 ,ny,nblock
!!$OMP DO &
!!$OMP SCHEDULE(DYNAMIC,1)
do ii = 1 + mod( jj/nblock + (icolor-1) ,2 ) * nblock ,nx,nblock*ncolor
if(iicolor <= ncolor)then
do ssi = 1,ssi_iter
do i = ii,min(ii+nblock-1,nx)
!$acc loop independent private(dfc_)
do j = jj,min(jj+nblock-1,ny)
! copy data
do l = 1,nv
dfc_(l) = df(l,0,j,i)
enddo
do k = 1,nz
do l = 1,nv
df(l,k-1,j,i) = dfc_(l)
dfc_(l) = f(l,k,j,i)
dfc_(l) = dfc_(l) + A(l,1,j,i) * df(l,k,j,i-1)
dfc_(l) = dfc_(l) - A(l,2,j,i) * df(l,k,j,i+1)
dfc_(l) = dfc_(l) - A(l,3,j,i) * df(l,k,j,i-2)
dfc_(l) = dfc_(l) + A(l,4,j,i) * df(l,k,j,i+2)
dfc_(l) = dfc_(l) + A(l,5,j,i) * df(l,k,j-1,i)
dfc_(l) = dfc_(l) - A(l,6,j,i) * df(l,k,j+1,i)
dfc_(l) = dfc_(l) - A(l,7,j,i) * df(l,k,j-2,i)
dfc_(l) = dfc_(l) + A(l,8,j,i) * df(l,k,j+2,i)
enddo