Hello,
I am currently working on Re-engineering the SENGA+ application using the OPS-DSL framework. OPS-DSL is collection of software library and code translation tools which enables automatic parallelization on multi-core and many-core architecture for multi-block structured mesh algorithms.
In OPS-DSL, we need to re-write the DO-LOOPS in form on OPS_PAR_LOOP call + Elemental kernel.
// Sample loop
DO kc = 1,nzsize
DO jc = 1,nysize
DO ic = 1,nxsize
fdiffa = functn(ic+1,jc,kc) - functn(ic-1,jc,kc)
fdiffb = functn(ic+2,jc,kc) - functn(ic-2,jc,kc)
fdiffc = functn(ic+3,jc,kc) - functn(ic-3,jc,kc)
fdiffd = functn(ic+4,jc,kc) - functn(ic-4,jc,kc)
fdiffe = functn(ic+5,jc,kc) - functn(ic-5,jc,kc)
fderiv(ic,jc,kc) = acoffx*fdiffa + bcoffx*fdiffb &
+ ccoffx*fdiffc + dcoffx*fdiffd &
+ ecoffx*fdiffe
END DO
END DO
END DO
// Elemental Kernel
SUBROUTINE dfbydx_kernel_interior(functn, fderiv)
IMPLICIT NONE
REAL(kind=8), dimension(1), intent(in) :: functn
REAL(kind=8), dimension(1) :: fderiv
REAL(kind=8) :: fdiffa, fdiffb, fdiffc, fdiffd, fdiffe
fdiffa = functn(OPS_ACC1(1,0,0)) - functn(OPS_ACC1(-1,0,0))
fdiffb = functn(OPS_ACC1(2,0,0)) - functn(OPS_ACC1(-2,0,0))
fdiffc = functn(OPS_ACC1(3,0,0)) - functn(OPS_ACC1(-3,0,0))
fdiffd = functn(OPS_ACC1(4,0,0)) - functn(OPS_ACC1(-4,0,0))
fdiffe = functn(OPS_ACC1(5,0,0)) - functn(OPS_ACC1(-5,0,0))
fderiv(OPS_ACC2(0,0,0)) = acoffx*fdiffa + bcoffx*fdiffb &
+ ccoffx*fdiffc + dcoffx*fdiffd &
+ ecoffx*fdiffe
END SUBROUTINE dfbydx_kernel_interior
// OPS_PAR_LOOP call
rangexyz = (/1,nxsize, 1,nysize, 1,nzsize/)
call ops_par_loop(dfbydx_kernel_interior, "dfbydx_interior_scheme", senga_grid, 3, rangexyz, &
ops_arg_dat(d_functn, 1, s3d_p500_to_m500_x, "real(8)", OPS_READ), &
ops_arg_dat(d_fderiv, 1, s3d_000, "real(8)", OPS_WRITE))
//OPS_ARG_DAT (dataset, dim, stencil, data_type, access)
I pasted the code above showing how this loop transformation is done. Both these call and elemental kernel is then parsed by OPS code translation tool and relevant architecture specific template is generated for each loop.
Following is sample template generated for CPU version
MODULE DFBYDX_KERNEL_INTERIOR_MODULE
USE OPS_FORTRAN_DECLARATIONS
USE OPS_FORTRAN_RT_SUPPORT
USE OPS_CONSTANTS
USE ISO_C_BINDING
IMPLICIT NONE
contains
!user function
!DEC$ ATTRIBUTES FORCEINLINE :: dfbydx_kernel_interior
SUBROUTINE dfbydx_kernel_interior(functn, fderiv)
! function body from elemental kernel
END SUBROUTINE
SUBROUTINE dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
dat1_base, dat2_base, start, end )
…
DO n_z = 1, end(3)-start(3)+1
DO n_y = 1, end(2)-start(2)+1
!DIR$ SIMD
DO n_x = 1, end(1)-start(1)+1
call dfbydx_kernel_interior( &
& opsDat1Local(dat1_base+(n_x-1)*1 + (n_y-1)*xdim1*1 + &
& (n_z-1)*ydim1*xdim1*1), &
& opsDat2Local(dat2_base+(n_x-1)*1 + (n_y-1)*xdim2*1 + &
& (n_z-1)*ydim2*xdim2*1) )
END DO
END DO
END DO
END SUBROUTINE
!host subroutine
SUBROUTINE dfbydx_kernel_interior_host( userSubroutine, block, dim, range, &
opsArg1, opsArg2)
…
opsArgArray(1) = opsArg1
opsArgArray(2) = opsArg2
…
call ops_halo_exchanges(opsArgArray,2,range)
…
call dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
dat1_base, dat2_base, start, end )
…
call ops_set_dirtybit_host(opsArgArray, 2)
call ops_set_halo_dirtybit3(opsArg2,range)
…
!Timing and data movement
…
END SUBROUTINE
END MODULE DFBYDX_KERNEL_INTERIOR_MODULE
Following is sample template generated for GPU
MODULE DFBYDX_KERNEL_INTERIOR_MODULE
USE OPS_FORTRAN_DECLARATIONS
USE OPS_FORTRAN_RT_SUPPORT
USE OPS_CONSTANTS
USE ISO_C_BINDING
USE CUDAFOR
contains
! user function
attributes (device) SUBROUTINE dfbydx_kernel_interior_gpu(functn, fderiv)
! function body from elemental kernel
END SUBROUTINE
! CUDA kernel function -- wrapper calling user kernel
attributes (global) SUBROUTINE dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
dat1_base, dat2_base, &
size1, size2, size3 )
…
n_z = blockDim%z * (blockIdx%z-1) + threadIdx%z
n_y = blockDim%y * (blockIdx%y-1) + threadIdx%y
n_x = blockDim%x * (blockIdx%x-1) + threadIdx%x
…
arg1 = (n_x-1) * 1*1 + (n_y-1) * 1*1 * xdim1_dfbydx_kernel_interior + (n_z-1) * 1*1 * &
xdim1_dfbydx_kernel_interior * ydim1_dfbydx_kernel_interior
arg2 = (n_x-1) * 1*1 + (n_y-1) * 1*1 * xdim2_dfbydx_kernel_interior + (n_z-1) * 1*1 * &
xdim2_dfbydx_kernel_interior * ydim2_dfbydx_kernel_interior
…
IF ((n_x-1) < size1 .AND. (n_y-1) < size2 .AND. (n_z-1) < size3) THEN
call dfbydx_kernel_interior_gpu( opsDat1Local(dat1_base+arg1), &
opsDat2Local(dat2_base+arg2) )
END IF
…
END SUBROUTINE
!host subroutine
attributes (host) SUBROUTINE dfbydx_kernel_interior_host( userSubroutine, block, dim, &
range, opsArg1, opsArg2)
…
opsArgArray(1) = opsArg1
opsArgArray(2) = opsArg2
…
x_size = MAX(0,end(1)-start(1)+1)
y_size = MAX(0,end(2)-start(2)+1)
z_size = MAX(0,end(3)-start(3)+1)
…
grid = dim3( (x_size-1)/getOPS_block_size_x()+ 1, (y_size-1)/getOPS_block_size_y() + 1,
z_size)
tblock = dim3(getOPS_block_size_x(),getOPS_block_size_y(),1)
…
!halo exchanges
call ops_halo_exchanges(opsArgArray,2,range)
! call kernel wrapper function, passing in data pointers
call dfbydx_kernel_interior_wrap <<<grid, tblock>>> (opsDat1Local, opsDat2Local, &
dat1_base, dat2_base, &
x_size, y_size, z_size )
…
istat = cudaDeviceSynchronize()
…
call ops_set_dirtybit_device(opsArgArray, 2)
call ops_set_halo_dirtybit3(opsArg2,range)
…
!Timing and data movement
…
END SUBROUTINE
END MODULE DFBYDX_KERNEL_INTERIOR_MODULE
Initially, when we compared the results of the original SENGA+ with the generated CPU version, we noticed differences only after the 12th or 13th decimal places, which was well within our expectations since we are using double precision type. However, when comparing these results with CUDA-accelerated computations, discrepancies began to surface as early as the 2nd or 3rd decimal places.
In our exploration of this issue, we discovered that the “Implicit none” statement inside the elemental kernel function was a contributing factor. Upon its removal, we achieved an exact match for the CPU version with the original results, which was a significant step forward.
However, we are still encountering challenges with the CUDA implementation, as differences persist after the 7th or 8th decimal places. This level of discrepancy falls short of our precision requirements. Our objective is to attain a precision match of at least the first 10 decimal places for the CUDA-accelerated computations.
Following is the information about the compiler we are using:
nvhpc/23.1
pgfortran (aka nvfortran) 23.1-0 64-bit target on x86-64 Linux -tp sandybridge
PGI Compilers and Tools
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES
Compiler Flags: (Tesla P100 GPU)
-O3 -fast -gopt -Mdclchk -Kieee -nofma -Mnofpapprox -mcmodel=medium -Ktrap=unf -Mnofprelaxed -m64 -pc 64 -Mcuda=cc60 -Mcuda=cc60
Please provide suggestions on how we can achive the precision
Do Let me know if any more information is needed.