# B-splines on GPUs: OpenACC - Fortran

Hello everyone,

I have the following problem. I have a main subroutine, let us call it `main_function` (for 3D BSplines). It takes as input several tensors.

This function contains only IF-conditions. If a condition is satisfied, other functions are called. Let us call these functions: `function_a`, `function_b`, and `function_c` which are parallelizable.

The structure is as follows

``````subroutine main_function(paras)
if(1) then
call function_a
else if (2)
call function_b
else if (3)
call function_c
end if
end subroutine main_function
``````

with

``````subroutine function_a(paras)
!\$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_a

subroutine function_b(paras)
!\$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_b

subroutine function_c(paras)
!\$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_c
``````

The subroutines `function_a`, `function_b`, and `function_c` have a B-spline tensor (`eta`) as an output calculated on GPU. I donât want to move this tensor to the host since it is not needed there. However, after calculating `eta` on GPU using `main_function`, an interpolation subroutine `interpolate3D` is called to interpolate the function. The definition of `interpolate3D` is something like

``````subroutine interpolate3D(eta, x, y, z, fAtxyz)

!\$acc routine seq

interpolate ...
end subroutine interpolate3D
``````

To summarize the the pseudo-code is something like

``````call main_function(paras)

!\$acc parallel loop present(x, y, eta, fAtxyz)
do i = 1, N
call interpolate3D(eta, x(i), y(i), z(i), fAtxyz(i))
end do
``````

My problems and questions are:

1)- When I donât use â`!\$acc update self (eta)`â before the loop, the results are completely wrong. Does this mean that â`present clause`â doesnât find correctly `eta`, calculated by `main_function`, on GPU. Therefore, one needs to update the host, and then recopy it back to the GPU?

2)- How to ensure that `interpolate3D` is working on GPU? For example, if I donât have the above loop, does only adding â`!\$acc routine seq`â ensure that it works on GPU and searches for different quantities there?

3)- In fact, when there is no loop, adding â`!\$acc update self (eta)`â is required to have correct results. Does this mean that in this case the subroutine is executed on CPU?

3)- To summarize, If I have two subroutines: the first choses between different subroutines based on if-conditions to calculate a vector or tensor and keep it on GPU (I donât want to update the host), while the second will use this vector to perform some calculations on GPU, how to do this correctly with `openACC`?

I attached a very simple example concerning the questions above. The subroutine `calc_etaVec` calculates `eta` on GPU, while the subroutine `calcFunAtx` interpolates at the position `xp` using `eta` (Nearest-neighbor interpolation). I would like to know if possible how to allow `calcFunAtx` to work directly with GPU data? Moreover, comments, correction or/and advice concerning the implementations are very welcome
program.f90 (1.9 KB)

Sorry for being long and thank you very much for your help,

the folks visiting the âHPC compilersâ forum might be more familiar with Fortran related CUDA questions

1 Like

thank you very much for your reply

Hi feynman.physik,

The error I see is that youâre missing a âparallel loopâ around the loop that calls âcalFunAtxâ. How you have it now, this is being run on the host.

The routine directive tells the compiler to create a device callable version of the subroutine which can then be called from within compute region. However unless the ânohostâ clause is added, a host version of the subroutine will get created as well. This is why it still links and runs.

Hereâs the fixed version:

``````% cat program.f90
MODULE funs
use paras_mod
implicit none
CONTAINS

subroutine calc_etaVec(x, nx, eta)
integer, intent(in) :: nx
real(dp), dimension(:), intent(in) :: x
real(dp), dimension(:), intent(out) :: eta

integer :: i
!\$acc parallel loop present(x, eta)
do i = 1, nx
eta(i) = sin(x(i))
end do
end subroutine

subroutine calcFunAtx(xp, dx, eta, fAtx)
real(dp), intent(in) :: xp, dx
real(dp), dimension(:), intent(in) :: eta
real(dp), intent(out) :: fAtx
integer :: idx
! small fix, original version had the routine seq
! after the idx line below.  Needs to be in the declaration section
!\$acc routine seq  !  nohost << add if you don't want a host version

idx = 1 + floor(xp / dx)

fAtx = eta(idx)
end subroutine calcFunAtx

END MODULE

module paras_mod
implicit none
save
INTEGER, PARAMETER :: dp = selected_real_kind(14,300)
REAL(dp), PARAMETER :: PI=4.0d0*atan(1.0d0)
end module paras_mod

program approximateFun
use funs
use paras_mod

integer :: Nx
real(dp) :: dx
real(dp), dimension(:), allocatable :: x, eta
real(dp), dimension(5) :: xp, fAtxp

!\$acc declare create(Nx)
!\$acc declare create(x)
!\$acc declare create(eta)
!\$acc declare create(dx)
!\$acc declare create(fAtxp)
!\$acc declare create(xp)

Nx = 16
!\$acc update device(Nx)

xp = (/3.9, 4.1, 4.5, 5.0, 5.6/)
!\$acc update device(xp)

allocate(x(1 : Nx))
allocate(eta(1 : Nx))
eta = 0.0d0

dx = 2 * pi / (Nx - 1)
!\$acc update device(dx)

! x can be initialized on the GPU instead of updating it.
!\$acc parallel loop
do i = 1, Nx
x(i) = (i - 1.0d0) * dx
end do

call calc_etaVec(x, Nx, eta)

! FIX: ADD PARALLEL LOOP HERE
!\$acc parallel loop present(xp,dx,eta,fAtxp)
do i = 1, 5
call calcFunAtx(xp(i), dx, eta, fAtxp(i))
end do

!\$acc update self (fAtxp)

do i = 1, 5
write(6, *) 'xp, fAtxp', xp(i), fAtxp(i)
end do

deallocate(x)
deallocate(eta)

end program approximateFun
% nvfortran -Minfo=accel program.f90 -acc ; a.out
calc_etavec:
12, Generating present(x(:),eta(:))
Generating NVIDIA GPU code
13, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
calcfunatx:
18, Generating acc routine seq
Generating NVIDIA GPU code
approximatefun:
50, Generating create(nx) [if not already present]
53, Generating create(dx) [if not already present]
54, Generating create(fatxp(:)) [if not already present]
55, Generating create(xp(:)) [if not already present]
58, Generating update device(nx)
61, Generating update device(xp(:))
68, Generating update device(dx)
71, Generating NVIDIA GPU code
72, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
79, Generating present(xp(:),fatxp(:),dx)
Generating NVIDIA GPU code
80, !\$acc loop gang ! blockidx%x
84, Generating update self(fatxp(:))
xp, fAtxp    3.900000095367432       -0.5877852522924730
xp, fAtxp    4.099999904632568       -0.5877852522924730
xp, fAtxp    4.500000000000000       -0.8660254037844384
xp, fAtxp    5.000000000000000       -0.9945218953682733
xp, fAtxp    5.599999904632568       -0.7431448254773946
% nvfortran -Minfo=accel program.f90 ; a.out
xp, fAtxp    3.900000095367432       -0.5877852522924730
xp, fAtxp    4.099999904632568       -0.5877852522924730
xp, fAtxp    4.500000000000000       -0.8660254037844384
xp, fAtxp    5.000000000000000       -0.9945218953682732
xp, fAtxp    5.599999904632568       -0.7431448254773946
``````

Hope this helps,
Mat

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.