B-splines on GPUs: OpenACC - Fortran

Hello everyone,

I have the following problem. I have a main subroutine, let us call it main_function (for 3D BSplines). It takes as input several tensors.

This function contains only IF-conditions. If a condition is satisfied, other functions are called. Let us call these functions: function_a, function_b, and function_c which are parallelizable.

The structure is as follows

subroutine main_function(paras)
if(1) then
call function_a
else if (2)
call function_b
else if (3)
call function_c
end if
end subroutine main_function

with

subroutine function_a(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_a

subroutine function_b(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_b

subroutine function_c(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_c

The subroutines function_a, function_b, and function_c have a B-spline tensor (eta) as an output calculated on GPU. I don’t want to move this tensor to the host since it is not needed there. However, after calculating eta on GPU using main_function, an interpolation subroutine interpolate3D is called to interpolate the function. The definition of interpolate3D is something like

subroutine interpolate3D(eta, x, y, z, fAtxyz)

!$acc routine seq 

interpolate ...
end subroutine interpolate3D

To summarize the the pseudo-code is something like

call main_function(paras) 

!$acc parallel loop present(x, y, eta, fAtxyz)
do i = 1, N
call interpolate3D(eta, x(i), y(i), z(i), fAtxyz(i))
end do 

My problems and questions are:

1)- When I don’t use ‘!$acc update self (eta)’ before the loop, the results are completely wrong. Does this mean that ‘present clause’ doesn’t find correctly eta, calculated by main_function, on GPU. Therefore, one needs to update the host, and then recopy it back to the GPU?

2)- How to ensure that interpolate3D is working on GPU? For example, if I don’t have the above loop, does only adding ‘!$acc routine seq’ ensure that it works on GPU and searches for different quantities there?

3)- In fact, when there is no loop, adding ‘!$acc update self (eta)’ is required to have correct results. Does this mean that in this case the subroutine is executed on CPU?

3)- To summarize, If I have two subroutines: the first choses between different subroutines based on if-conditions to calculate a vector or tensor and keep it on GPU (I don’t want to update the host), while the second will use this vector to perform some calculations on GPU, how to do this correctly with openACC?

I attached a very simple example concerning the questions above. The subroutine calc_etaVec calculates eta on GPU, while the subroutine calcFunAtx interpolates at the position xp using eta (Nearest-neighbor interpolation). I would like to know if possible how to allow calcFunAtx to work directly with GPU data? Moreover, comments, correction or/and advice concerning the implementations are very welcome
program.f90 (1.9 KB)

Sorry for being long and thank you very much for your help,

the folks visiting the “HPC compilers” forum might be more familiar with Fortran related CUDA questions

1 Like

thank you very much for your reply

Hi feynman.physik,

The error I see is that you’re missing a “parallel loop” around the loop that calls “calFunAtx”. How you have it now, this is being run on the host.

The routine directive tells the compiler to create a device callable version of the subroutine which can then be called from within compute region. However unless the “nohost” clause is added, a host version of the subroutine will get created as well. This is why it still links and runs.

Here’s the fixed version:

% cat program.f90
MODULE funs
    use paras_mod
    implicit none
    CONTAINS

    subroutine calc_etaVec(x, nx, eta)
        integer, intent(in) :: nx
        real(dp), dimension(:), intent(in) :: x
        real(dp), dimension(:), intent(out) :: eta

        integer :: i
        !$acc parallel loop present(x, eta)
        do i = 1, nx
            eta(i) = sin(x(i))
        end do
    end subroutine

    subroutine calcFunAtx(xp, dx, eta, fAtx)
        real(dp), intent(in) :: xp, dx
        real(dp), dimension(:), intent(in) :: eta
        real(dp), intent(out) :: fAtx
        integer :: idx
! small fix, original version had the routine seq
! after the idx line below.  Needs to be in the declaration section
        !$acc routine seq  !  nohost << add if you don't want a host version

        idx = 1 + floor(xp / dx)

        fAtx = eta(idx)
    end subroutine calcFunAtx

END MODULE

module paras_mod
    implicit none
    save
    INTEGER, PARAMETER :: dp = selected_real_kind(14,300)
    REAL(dp), PARAMETER :: PI=4.0d0*atan(1.0d0)
end module paras_mod

program approximateFun
use funs
use paras_mod

    integer :: Nx
    real(dp) :: dx
    real(dp), dimension(:), allocatable :: x, eta
    real(dp), dimension(5) :: xp, fAtxp

    !$acc declare create(Nx)
    !$acc declare create(x)
    !$acc declare create(eta)
    !$acc declare create(dx)
    !$acc declare create(fAtxp)
    !$acc declare create(xp)

    Nx = 16
    !$acc update device(Nx)

    xp = (/3.9, 4.1, 4.5, 5.0, 5.6/)
    !$acc update device(xp)

    allocate(x(1 : Nx))
    allocate(eta(1 : Nx))
    eta = 0.0d0

    dx = 2 * pi / (Nx - 1)
    !$acc update device(dx)

! x can be initialized on the GPU instead of updating it.
    !$acc parallel loop
    do i = 1, Nx
        x(i) = (i - 1.0d0) * dx
    end do

    call calc_etaVec(x, Nx, eta)

! FIX: ADD PARALLEL LOOP HERE
    !$acc parallel loop present(xp,dx,eta,fAtxp)
    do i = 1, 5
        call calcFunAtx(xp(i), dx, eta, fAtxp(i))
    end do

    !$acc update self (fAtxp)

    do i = 1, 5
        write(6, *) 'xp, fAtxp', xp(i), fAtxp(i)
    end do

    deallocate(x)
    deallocate(eta)

end program approximateFun
% nvfortran -Minfo=accel program.f90 -acc ; a.out
calc_etavec:
     12, Generating present(x(:),eta(:))
         Generating NVIDIA GPU code
         13, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
calcfunatx:
     18, Generating acc routine seq
         Generating NVIDIA GPU code
approximatefun:
     50, Generating create(nx) [if not already present]
     53, Generating create(dx) [if not already present]
     54, Generating create(fatxp(:)) [if not already present]
     55, Generating create(xp(:)) [if not already present]
     58, Generating update device(nx)
     61, Generating update device(xp(:))
     68, Generating update device(dx)
     71, Generating NVIDIA GPU code
         72, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     79, Generating present(xp(:),fatxp(:),dx)
         Generating NVIDIA GPU code
         80, !$acc loop gang ! blockidx%x
     84, Generating update self(fatxp(:))
 xp, fAtxp    3.900000095367432       -0.5877852522924730
 xp, fAtxp    4.099999904632568       -0.5877852522924730
 xp, fAtxp    4.500000000000000       -0.8660254037844384
 xp, fAtxp    5.000000000000000       -0.9945218953682733
 xp, fAtxp    5.599999904632568       -0.7431448254773946
% nvfortran -Minfo=accel program.f90 ; a.out
 xp, fAtxp    3.900000095367432       -0.5877852522924730
 xp, fAtxp    4.099999904632568       -0.5877852522924730
 xp, fAtxp    4.500000000000000       -0.8660254037844384
 xp, fAtxp    5.000000000000000       -0.9945218953682732
 xp, fAtxp    5.599999904632568       -0.7431448254773946

Hope this helps,
Mat