Placement of private clause

Hi,

I’m trying to port a large code from the CPU to the GPU using OpenACC.
My idea was to replace the OpenMP directives by the OpenACC kernels directives. But according to the PGI compiler output the code below seems difficult to parallelize. Although it provides a lot of parallelism. I wanted to start making the whole loop construct parallel by starting at the inside.

I wanted to improve the performance by placing the private clause.
But I don’t understand where to place it.
The two inner most loops k and l can be made parallel. Each worker (thread) accesses one element of the array dfc_.
But before the two innermost loops some data needs to be copied into the dfc_ array.
Now my question is where to place the private clause? I placed it before the j loop.
But this does not improve the performance. Is this placement correct?

Thank you for your help

 ! old OpenMP code
 !!$OMP PARALLEL &
 !!$OMP PRIVATE(i,j,k,l,ii,jj,ssi,j0,iicolor,icolor)
 
 !$acc kernels present(df,f,A)
  do iicolor = 1,ncolor*2
     icolor  = iicolor_no(iicolor)
     do jj = 1 ,ny,nblock
     !!$OMP DO &
     !!$OMP SCHEDULE(DYNAMIC,1)
     do ii = 1 + mod( jj/nblock + (icolor-1) ,2 ) * nblock ,nx,nblock*ncolor
     if(iicolor <= ncolor)then
        do ssi = 1,ssi_iter
             do i = ii,min(ii+nblock-1,nx)
                 !$acc loop independent private(dfc_)
                  do j = jj,min(jj+nblock-1,ny)
                   ! copy data
                   do l = 1,nv
                      dfc_(l) = df(l,0,j,i)
                   enddo
                   do k = 1,nz
                      do l = 1,nv
                         df(l,k-1,j,i) = dfc_(l)
                         dfc_(l) = f(l,k,j,i)
                         dfc_(l) = dfc_(l) + A(l,1,j,i) * df(l,k,j,i-1)
                         dfc_(l) = dfc_(l) - A(l,2,j,i) * df(l,k,j,i+1)
                         dfc_(l) = dfc_(l) - A(l,3,j,i) * df(l,k,j,i-2)
                         dfc_(l) = dfc_(l) + A(l,4,j,i) * df(l,k,j,i+2)
                         dfc_(l) = dfc_(l) + A(l,5,j,i) * df(l,k,j-1,i)
                         dfc_(l) = dfc_(l) - A(l,6,j,i) * df(l,k,j+1,i)
                         dfc_(l) = dfc_(l) - A(l,7,j,i) * df(l,k,j-2,i)
                         dfc_(l) = dfc_(l) + A(l,8,j,i) * df(l,k,j+2,i)
                      enddo

Hi Peter85,

I wanted to improve the performance by placing the private clause.

Privatization isn’t about performance, but rather creating independent copies of an scratch array or scalar which otherwise would prevent parallelization of loop. This may indeed help performance given a loop can now be parallelized, but not necessarily. Rather, you’ll want to look at which loops are best to be parallelized and then place the private clause at the inner most of those loops.

Now my question is where to place the private clause?

There’s not enough information here to give you a good answers. Placement will largely depend on the loop trip counts and the loop index that’s used to access the stride-1 dimension (Fortran is column-major so the first dimension).

My first thought when looking at the code (and I could be wrong) is if the blocking is needed? I can understand why you’d want it for OpenMP, but for massively parallel devices like GPUs, it can be a hindrance.

Also, it looks to me that you’re using a trip count of “ncolor*2” in case the block aren’t evenly divided. Since you’re only doing computation when iicolor < ncolor, you’ll be launching twice as many gangs (CUDA blocks) as needed since half wont be doing any work. No a big deal, but there may be some extra performance loss due to this. Can this be removed as well?

Assuming blocking can be removed, each loop (except k and l) have no dependencies, and the loop trip counts are sufficiently large, I’d probably start by collapsing the i and j loops into a single vector loop. You can then try using a worker loop around ssi, or even collapsing ssi, i, and j.

!$acc kernels present(df,f,A)
!$acc loop gang
  do iicolor = 1,ncolor
        icolor  = iicolor_no(iicolor)
!TRY 1:  !$acc loop seq
!TRY 2:  !$acc loop worker
!TRY 3:  !$acc loop collapse(3) private(dfc_)
        do ssi = 1,ssi_iter
!TRY 1,2  !$acc loop collapse(2) private(dfc_)
             do i = 1,nx
                  do j = 1,ny

Also if possible, the striding of “A” and “df” are not the best. If you can, try reorganizing the data layout:

A(l,1,j,i) => A(i,j,l,1)
df(l,k,j,i-1) => df(i-1,j,l,k)

Hope this helps,
Mat

Thank you for your input! I will try it!