OpenAcc in Fortran subroutine

Hello, I try to use OpenAcc to accelerate my finite difference method code.
I encounter a problem. The following is my code

program FDM
...
! iteration loop....
!$acc data copy(RHS_U,TMP1,K,R,S), copyin(dx,dy)
do
...
!$acc kernels
DO I = 2, Nx-1
DO J = 2, Ny-1
  RHS_U(I,J) = K(I,J) * ( -TMP1(I-2,J) + 16.0_DP * TMP1(I-1,J) - 30.0_DP * TMP1(I,J)   + &
                           16.0_DP * TMP1(I+1,J) - TMP1(I+2,J) ) / ( 12.0_DP * DX**2 ) + &
               K(I,J) * ( -TMP1(I,J-2) + 16.0_DP * TMP1(I,J-1) - 30.0_DP * TMP1(I,J)   + &
                           16.0_DP * TMP1(I,J+1) - TMP1(I,J+2) ) / ( 12.0_DP * DY**2 ) + &
             - R(I,J) * TMP1(I,J) + S(I,J)            
END DO
END DO
!$acc end kernels
...
end do
! end iteration loop....
...
end program

In the above code, the total computational time is about 10(s). Now, I move the RHS_U calculation into a subroutine which is called CD4. The subroutine is as follows

 SUBROUTINE CD4( Nx, Ny, K , R , S , dx, dy, RHS_U , TMP1 )
  
  IMPLICIT NONE
  INTEGER                       :: I , J
  INTEGER       , INTENT(INOUT) :: Nx 
  INTEGER       , INTENT(INOUT) :: Ny
  
  REAL(KIND=DP) , INTENT(INOUT) :: K(:,:)
  REAL(KIND=DP) , INTENT(INOUT) :: R(:,:)
  REAL(KIND=DP) , INTENT(INOUT) :: S(:,:)
  
  REAL(KIND=DP) , INTENT(INOUT) :: RHS_U(:,:)
  REAL(KIND=DP) , INTENT(INOUT) :: TMP1(:,:)
  REAL(KIND=DP) , INTENT(INOUT) :: DX , DY  
  
  !$acc kernels present(RHS_U,K,R,S,TMP1,DX,DY)
  DO I = 2 , Nx-1
  DO J = 2 , Ny-1
    RHS_U(I,J) = K(I,J) * ( -TMP1(I-2,J) + 16.0_DP * TMP1(I-1,J) - 30.0_DP * TMP1(I,J)   + &
                             16.0_DP * TMP1(I+1,J) - TMP1(I+2,J) ) / ( 12.0_DP * DX**2 ) + &
                 K(I,J) * ( -TMP1(I,J-2) + 16.0_DP * TMP1(I,J-1) - 30.0_DP * TMP1(I,J)   + &
                             16.0_DP * TMP1(I,J+1) - TMP1(I,J+2) ) / ( 12.0_DP * DY**2 ) + &
               - R(I,J) * TMP1(I,J) + S(I,J)            
  END DO
  END DO
  !$acc end kernels
        
END SUBROUTINE

The original code becomes

program FDM
...
! iteration loop....
!$acc data copy(RHS_U,TMP1,K,R,S), copyin(dx,dy)
do
...
CALL CD4( Nx, Ny, K , R , S , dx, dy, RHS_U , TMP1 )
...
end do
! end iteration loop....
...
end program

However, the total computational time becomes about 18(s), the performance is reduced. I do not know the reason. Any idea ? I use the PGI Accelerator Fortran workstation V13.8

Hi SCCS,

I’ve forgotten if we do this in v13.8, but we do use “INTENT(IN)” to determine if a read-only array can be placed in texture memory. By you using “INTENT(INOUT)” this may be inhibited. What happens is you change all but “RHS_U” to be “INTENT(IN)”?

If that doesn’t help, can you post the compiler feedback messages (-Minfo=accel) for each case? Also, please post the profile information but setting PGI_ACC_TIME=1 in your environment.

  • Mat