Handling global variables inside OpenMP offlload kernels

I am working on DSL framework which helps to generate OpenMP offload version

I am currently working on Fortran 90 code generation and has some doubts related to handling constants and local arrays inside the OpenMP offload kernel

curently i am doing following

MODULE OPS_CONSTANTS
integer :: imax, jmax
real(8) :: pi = 2.0_8 * ASIN(1.0_8)

!$OMP DECLARE TARGET(imax)
!$OMP DECLARE TARGET(jmax)
!$OMP DECLARE TARGET(pi)

END MODULE OPS_CONSTANTS

Then in main program

PROGRAM laplace
  use OPS_CONSTANTS
  use APPLY_STENCIL_KERNEL_MODULE
  imax = 4094
  jmax = 4094

! Allocate array A and Anew on Device

!$OMP TARGET ENTER DATA MAP(TO:imax)
!$OMP TARGET UPDATE TO(imax)
!$OMP TARGET ENTER DATA MAP(TO:jmax)
!$OMP TARGET UPDATE TO(jmax)
!$OMP TARGET ENTER DATA MAP(TO:pi)
!$OMP TARGET UPDATE TO(pi)

CALL apply_stencil_kernel_host(A,Anew,error)

END PROGRAM laplace

and in Apply Stencil Kernel

MODULE APPLY_STENCIL_KERNEL_MODULE

    USE OPS_CONSTANTS
    USE, INTRINSIC :: ISO_C_BINDING

    IMPLICIT NONE

!$OMP DECLARE TARGET(xdim1_apply_stencil_kernel)
    INTEGER(KIND=4) :: xdim1_apply_stencil_kernel

SUBROUTINE apply_stencil_kernel_host(A, Anew, error)
    REAL(KIND=8), DIMENSION(:,:), INTENT(IN) :: A
    REAL(KIND=8), DIMENSION(:,:) :: Anew
    REAL(KIND=8) :: error
    INTEGER(KIND=4) :: n_x, n_y
    INTEGER(KIND=4), DIMENSION(2)             :: idx_local

    xdim1_apply_stencil_kernel = imax
!$OMP TARGET ENTER DATA MAP(TO:xdim1_apply_stencil_kernel)
!$OMP TARGET UPDATE TO(xdim1_apply_stencil_kernel)

!$OMP TARGET DATA MAP(TO:idx_local(1:2))
!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO PRIVATE(n_x,n_y) REDUCTION(MAX:error)    
     DO n_y = 1, jmax
        DO n_x = 1, imax
            idx_local = [n_x,n_y]
           Anew(n_x,n_y) = 0.25_8 * ( A(n_x+1,n_y) + A(n_x-1,n_y) &
                                  & + A(n_x,-n_y1) + A(n_x,n_y+1) )
! Some code using  xdim1_apply_stencil_kernel and idx_local, ONLY READ

        END DO
   END DO
!$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO
!$OMP END TARGET DATA

END SUBROUTINE apply_stencil_kernel_host(A, Anew, error)

END MODULE APPLY_STENCIL_KERNEL_MODULE

is this correct approach??

Using “declare target” on module scalars isn’t really necessary. It’s only necessary if the code was directly accessing the module variables from a device subroutine. Though it shouldn’t hurt.

The “ENTER DATA MAP” directives are redundant since the variable is already declared on the device.

I’d recommend collapsing the parallel loops since currently you only have the “n_y” loop being parallelized. Granted there could be something in the omitted code that creates a dependency in “n_x”, but that’s unlikely.

!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(2) PRIVATE(n_x,n_y) REDUCTION(MAX:error)    
     DO n_y = 1, jmax
        DO n_x = 1, imax

You should also try using “LOOP” instead since it will often give better performance than “DISTRIBUTE PARALLEL DO”. Not always, but an easy experiment to try:

!$OMP TARGET TEAMS LOOP COLLAPSE(2) PRIVATE(n_x,n_y) REDUCTION(MAX:error)    
     DO n_y = 1, jmax
        DO n_x = 1, imax

Hi Mat,

In this case i am accessing the module variables directly in some other routines. I will drop the “ENTER DATA MAP” directives as mentioned.
I will also do change to “LOOP” instead of “DISTRIBUTE PARALLEL DO” and COLLAPSE as mentioned.
Thanks for the help.

–Ashutosh S. Londhe

Hi @MatColgrove

I tried LOOP instead of DISTRIBUTE , but with COLLAPSE and REDUCTION together
i am getting following error
“Fatal error: Could not launch CUDA kernel on device 0, error 1”

Can provide a full reproducing example?

I tried with what you posted but wasn’t able to reproduce the error. I assume there’s something in the omitted code that’s causing the problem.

Hi @MatColgrove

!$OMP TARGET TEAMS LOOP COLLAPSE(2) PRIVATE(n_x,n_y) REDUCTION(MAX:opsDat3Local)
    DO n_y = 1, end_indx(2)-start_indx(2)+1
        DO n_x = 1, end_indx(1)-start_indx(1)+1

                CALL apply_stencil_kernel( &
                opsDat1Local(dat1_base + ((n_x-1)*1*1) + ((n_y-1)*xdim1_apply_stencil_kernel*1*1)), &
                opsDat2Local(dat2_base + ((n_x-1)*1*1) + ((n_y-1)*xdim2_apply_stencil_kernel*1*1)), &
                opsDat3Local(dat3_base) &
               )

        END DO
    END DO
!$OMP END TARGET TEAMS LOOP

This is the peice of code which is giving problem.
if i drop COLLAPSE here, it runs fine

This is a part of OPS Library so you need to build the library and then run this
i will share the instrutions on the same in next chat

Hi @MatColgrove

Following are the instructions to build OPS library and build the Laplace application which i was talking about
For installing the OPS library:

create a source env file. Below is sample file for PGI compiler

export OPS_COMPILER=pgi
export OPS_INSTALL_PATH=<set the path till OPS/ops >

module purge

# Set correct NV_ARCH
export NV_ARCH=Volta
echo "GPU architecture" $NV_ARCH

#PGI MPI and Compilers
module load nvhpc/23.1

export CUDA_INSTALL_PATH=<CUDA HOME DIR>
export CUDA_MATH_LIBS=$CUDA_INSTALL_PATH/lib64
export LD_LIBRARY_PATH=$CUDA_INSTALL_PATH/lib64:$LD_LIBRARY_PATH

export MPI_INSTALL_PATH=<MPI HOME>
export PATH=$MPI_INSTALL_PATH/bin:$PATH
export LD_LIBRARY_PATH=$MPI_INSTALL_PATH/lib:$LD_LIBRARY_PATH

export OP_AUTO_SOA=1

export MPICPP=mpic++
export MPICH_CXX=pgc++
export MPICH_CC=pgcc
export MPICH_F90=pgfortran
export MPIF90_F90=pgfortran
export MPICH_FC=pgfortran


unset HDF5_INSTALL_PATH
export HDF5_INSTALL_PATH=<HDF5 HOME>
export PATH=$HDF5_INSTALL_PATH/bin:$PATH
export LD_LIBRARY_PATH=$HDF5_INSTALL_PATH/lib:$LD_LIBRARY_PATH

# Require python3.9 or above
module load python/3.9.7
source $OPS_INSTALL_PATH/../ops_translator/ops_venv/bin/activate
  1. Do git clone
git clone https://github.com/OP-DSL/OPS.git --branch feature/F90_offload
  1. Make necessary changes to source_env file and source it
  2. Go to OPS/ops_translator directory and run “setup_venv.sh” file. This will install necessary python packages required.
  3. Go to OPS/ops/fortran and do make
  4. Go to OPS/apps/fortran/laplace2dtutorial/step7 and run “make laplace2d_ompoffload”

this one will run as it does not have COLLAPSE in apply_stencil function
for this,open file openmp_offload/apply_stencil_kernel_ompoffload_kernel.F90 (this is generated by OPS code gen) and add the COLLAPSE and recompile for laplace2d_ompoffload

Thanks, I was able reproduce the error. Looks to be an issue with the array reduction in combination with the subroutine call. I see two work arounds.

Add the flag “-Minline” so “apply_stencil_kernel” is inlined instead of called.

Though I think a better solution is to use a scalar reduction instead. While opsDat3Local is only using a single element for the reduction and is only a one element array, it’s still an array and array reductions can incur additional overhead.

Something like:

    INTEGER(KIND=4) :: n_x, n_y
    REAL(KIND=8) :: error

!$OMP TARGET TEAMS LOOP COLLAPSE(2) PRIVATE(n_x,n_y) REDUCTION(MAX:error)
    DO n_y = 1, end_indx(2)-start_indx(2)+1
        DO n_x = 1, end_indx(1)-start_indx(1)+1
                call apply_stencil_kernel( &
                opsDat1Local(dat1_base + ((n_x-1)*1*1) + ((n_y-1)*xdim1_apply_stencil_kernel*1*1)), &
                opsDat2Local(dat2_base + ((n_x-1)*1*1) + ((n_y-1)*xdim2_apply_stencil_kernel*1*1)), &
                error &
               )
        END DO
    END DO
    opsDat3Local(dat3_base) = error

END SUBROUTINE

Note that for performance reasons, I’d recommend adding “-Minline” as well as using the scalar reduction.

Here’s the time I’m seeing on an A100 with just the scalar reduction:

% laplace2d_ompoffload
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0,        0.2500000
   10,        0.0215625
   20,        0.0114887
   30,        0.0078256
   40,        0.0058566
   50,        0.0047514
   60,        0.0039452
   70,        0.0034119
   80,        0.0029797
   90,        0.0026580
  100,        0.0024214
Total error is within        0.11622E-04 % of the expected error
This run is considered PASSED
 completed in        1.1174500 seconds

Recompiling with -Minline, significantly improves performance:

% laplace2d_ompoffload
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0,        0.2500000
   10,        0.0215625
   20,        0.0114887
   30,        0.0078256
   40,        0.0058566
   50,        0.0047514
   60,        0.0039452
   70,        0.0034119
   80,        0.0029797
   90,        0.0026580
  100,        0.0024214
Total error is within        0.11622E-04 % of the expected error
This run is considered PASSED
 completed in        0.0472980 seconds

While not significantly, “loop” with -Minline is faster than using “distribute parallel do”:

% laplace2d_ompoffload
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0,        0.2500000
   10,        0.0215625
   20,        0.0114887
   30,        0.0078256
   40,        0.0058566
   50,        0.0047514
   60,        0.0039452
   70,        0.0034119
   80,        0.0029797
   90,        0.0026580
  100,        0.0024214
Total error is within        0.11622E-04 % of the expected error
This run is considered PASSED
 completed in        0.0520730 seconds

Hi @MatColgrove Thanks for the suggestions. I have implemeted now in OPS code generation framework.

This not only helps in performance improvement but also helped to solve problem with other applications which were not able to produce correct results.

I have one small doubt

i have few constants assigned with some values in module file which is used to declare the globals which needs to be acceessed inside OpenMP offload kernels

for constants assigned with values

i am getting following warnings

NVFORTRAN-W-0155-Constant or Parameter used in data clause - ncofmx

what i have done in this case is, i have put following in constants module file

MODULE CONSTANTS
    integer(kind=4), parameter :: ncofmx=7
    !$OMP DECLARE TARGET(ncofmx)
END MODULE

and then somewhere in other .F90 files but before execting any OpenMP offload kernel
!$OMP TARGET UPDATE TO(ncofmx)

was this TARGET UPDATE was not required for variables initialized during declarations??
and is there any problem if i still put TARGET UPDATE??

Thanks
–Ashutosh S Londhe

It is required for variables, but this is a parameter. Since the value of a parameter is constant, the compiler will replace the “ncofmx” symbol with the value of “7” in all uses so the “declare target” isn’t needed.

Hi @MatColgrove
you mean "TARGET UPDATE " isn’t needed right??
DECLARE TARGET is still needed right in this case?? as i was getting error if i drop DECLARE TARGET that “acc declare target is required for constants inside offload kernel”

Neither should be needed.

A parameter is not a variable. It has no storage so can’t be created nor can be updated. Again, the compiler will replace instances of the symbol name with the literal value.

Hi @MatColgrove,

Thanks for clarification. I forget to add word “parameter” in declaration that’s why the error was coming in compilation. Corrected now.

Thanks
Ashutosh S. Londhe