# B-splines on GPUs: OpenACC - Fortran

Hello everyone,

I have the following problem. I have a main subroutine, let us call it `main_function` (for 3D BSplines). It takes as input several tensors.

This function contains only IF-conditions. If a condition is satisfied, other functions are called. Let us call these functions: `function_a`, `function_b`, and `function_c` which are parallelizable.

The structure is as follows

``````subroutine main_function(paras)
if(1) then
call function_a
else if (2)
call function_b
else if (3)
call function_c
end if
end subroutine main_function
``````

with

``````subroutine function_a(paras)
!\$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_a

subroutine function_b(paras)
!\$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_b

subroutine function_c(paras)
!\$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_c
``````

The subroutines `function_a`, `function_b`, and `function_c` have a B-spline tensor (`eta`) as an output calculated on GPU. I don’t want to move this tensor to the host since it is not needed there. However, after calculating `eta` on GPU using `main_function`, an interpolation subroutine `interpolate3D` is called to interpolate the function. The definition of `interpolate3D` is something like

``````subroutine interpolate3D(eta, x, y, z, fAtxyz)

!\$acc routine seq

interpolate ...
end subroutine interpolate3D
``````

To summarize the the pseudo-code is something like

``````call main_function(paras)

!\$acc parallel loop present(x, y, eta, fAtxyz)
do i = 1, N
call interpolate3D(eta, x(i), y(i), z(i), fAtxyz(i))
end do
``````

My problems and questions are:

1)- When I don’t use ‘`!\$acc update self (eta)`’ before the loop, the results are completely wrong. Does this mean that ‘`present clause`’ doesn’t find correctly `eta`, calculated by `main_function`, on GPU. Therefore, one needs to update the host, and then recopy it back to the GPU?

2)- How to ensure that `interpolate3D` is working on GPU? For example, if I don’t have the above loop, does only adding ‘`!\$acc routine seq`’ ensure that it works on GPU and searches for different quantities there?

3)- In fact, when there is no loop, adding ‘`!\$acc update self (eta)`’ is required to have correct results. Does this mean that in this case the subroutine is executed on CPU?

3)- To summarize, If I have two subroutines: the first choses between different subroutines based on if-conditions to calculate a vector or tensor and keep it on GPU (I don’t want to update the host), while the second will use this vector to perform some calculations on GPU, how to do this correctly with `openACC`?

I attached a very simple example concerning the questions above. The subroutine `calc_etaVec` calculates `eta` on GPU, while the subroutine `calcFunAtx` interpolates at the position `xp` using `eta` (Nearest-neighbor interpolation). I would like to know if possible how to allow `calcFunAtx` to work directly with GPU data? Moreover, comments, correction or/and advice concerning the implementations are very welcome
program.f90 (1.9 KB)

Sorry for being long and thank you very much for your help,

the folks visiting the “HPC compilers” forum might be more familiar with Fortran related CUDA questions

1 Like

Hi feynman.physik,

The error I see is that you’re missing a “parallel loop” around the loop that calls “calFunAtx”. How you have it now, this is being run on the host.

The routine directive tells the compiler to create a device callable version of the subroutine which can then be called from within compute region. However unless the “nohost” clause is added, a host version of the subroutine will get created as well. This is why it still links and runs.

Here’s the fixed version:

``````% cat program.f90
MODULE funs
use paras_mod
implicit none
CONTAINS

subroutine calc_etaVec(x, nx, eta)
integer, intent(in) :: nx
real(dp), dimension(:), intent(in) :: x
real(dp), dimension(:), intent(out) :: eta

integer :: i
!\$acc parallel loop present(x, eta)
do i = 1, nx
eta(i) = sin(x(i))
end do
end subroutine

subroutine calcFunAtx(xp, dx, eta, fAtx)
real(dp), intent(in) :: xp, dx
real(dp), dimension(:), intent(in) :: eta
real(dp), intent(out) :: fAtx
integer :: idx
! small fix, original version had the routine seq
! after the idx line below.  Needs to be in the declaration section
!\$acc routine seq  !  nohost << add if you don't want a host version

idx = 1 + floor(xp / dx)

fAtx = eta(idx)
end subroutine calcFunAtx

END MODULE

module paras_mod
implicit none
save
INTEGER, PARAMETER :: dp = selected_real_kind(14,300)
REAL(dp), PARAMETER :: PI=4.0d0*atan(1.0d0)
end module paras_mod

program approximateFun
use funs
use paras_mod

integer :: Nx
real(dp) :: dx
real(dp), dimension(:), allocatable :: x, eta
real(dp), dimension(5) :: xp, fAtxp

!\$acc declare create(Nx)
!\$acc declare create(x)
!\$acc declare create(eta)
!\$acc declare create(dx)
!\$acc declare create(fAtxp)
!\$acc declare create(xp)

Nx = 16
!\$acc update device(Nx)

xp = (/3.9, 4.1, 4.5, 5.0, 5.6/)
!\$acc update device(xp)

allocate(x(1 : Nx))
allocate(eta(1 : Nx))
eta = 0.0d0

dx = 2 * pi / (Nx - 1)
!\$acc update device(dx)

! x can be initialized on the GPU instead of updating it.
!\$acc parallel loop
do i = 1, Nx
x(i) = (i - 1.0d0) * dx
end do

call calc_etaVec(x, Nx, eta)

! FIX: ADD PARALLEL LOOP HERE
!\$acc parallel loop present(xp,dx,eta,fAtxp)
do i = 1, 5
call calcFunAtx(xp(i), dx, eta, fAtxp(i))
end do

!\$acc update self (fAtxp)

do i = 1, 5
write(6, *) 'xp, fAtxp', xp(i), fAtxp(i)
end do

deallocate(x)
deallocate(eta)

end program approximateFun
% nvfortran -Minfo=accel program.f90 -acc ; a.out
calc_etavec:
12, Generating present(x(:),eta(:))
Generating NVIDIA GPU code
13, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
calcfunatx:
18, Generating acc routine seq
Generating NVIDIA GPU code
approximatefun:
50, Generating create(nx) [if not already present]
53, Generating create(dx) [if not already present]
54, Generating create(fatxp(:)) [if not already present]
55, Generating create(xp(:)) [if not already present]
58, Generating update device(nx)
61, Generating update device(xp(:))
68, Generating update device(dx)
71, Generating NVIDIA GPU code
72, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
79, Generating present(xp(:),fatxp(:),dx)
Generating NVIDIA GPU code
80, !\$acc loop gang ! blockidx%x
84, Generating update self(fatxp(:))
xp, fAtxp    3.900000095367432       -0.5877852522924730
xp, fAtxp    4.099999904632568       -0.5877852522924730
xp, fAtxp    4.500000000000000       -0.8660254037844384
xp, fAtxp    5.000000000000000       -0.9945218953682733
xp, fAtxp    5.599999904632568       -0.7431448254773946
% nvfortran -Minfo=accel program.f90 ; a.out
xp, fAtxp    3.900000095367432       -0.5877852522924730
xp, fAtxp    4.099999904632568       -0.5877852522924730
xp, fAtxp    4.500000000000000       -0.8660254037844384
xp, fAtxp    5.000000000000000       -0.9945218953682732
xp, fAtxp    5.599999904632568       -0.7431448254773946
``````

Hope this helps,
Mat

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.