So I have a subroutine that goes something like this.
do tmp_index = 1, mesh%Ninterior_faces/chunksize
do c_indx = 1, chunksize
do var = 1, Nvar
do p_indx = 1, P1
tmpi_left(p_indx, var, c_indx) = &
data%uflux(left_indices_1 + (p_indx - 1), var, left)
tmpi_rght(p_indx, var, c_indx) = &
data%uflux(rght_indices_1 + (p_indx - 1), var, rght)
end do
end do
end do
do c_indx = 1, chunksize
do var = 1, Nvar
do p_indx = 1, P1
sfi_left = mesh%sfi(p_indx, left_face_index, left)
sfi_rght = mesh%sfi(p_indx, rght_face_index, rght)
uflux_left(p_indx, var, c_indx) = tmpi_left(sfi_left, var, c_indx)
uflux_rght(p_indx, var, c_indx) = tmpi_rght(sfi_rght, var, c_indx)
end do
end do
end do
call acc_evaluate_interaction_flux(uflux_left, uflux_rght, Fi, Fv)
do c_indx = 1, chunksize
do var = 1, Nvar
do p_indx = 1, P1
data%Fi(left_indices_1 + (sfi_left-1), var, left) = &
Fi(p_indx, var, c_indx) +(Fv(p_indx, var, c_indx) + &
ibeta_viscous*nx*Fv_dot_n)
end do
end do
end do
end do
There are multiple subroutines in the code and I have converted two of them. The other subroutine goes much faster on the GPU than on the CPU after which I tried to convert this subroutine.
Now, the first problem. This particular subroutine runs much slower after compiling with gfortran than with PGI. It goes as much as 5 times slower.
The net result is that unlike the binary for gfortran, this particular subroutine becomes the costliest subroutine for the PGI binary.
Therefore even after putting the other subroutine on the GPU the code is much slower because of this subroutine.
The other problem is that while after some tuning I can get the subroutine to go faster on the GPU, but it goes faster only relative to the PGI run time, and is still much slower than the gfortran runtime.