Hello,
I have a 3D CFD (Finite Volumes scheme) application written in Fortran with already two levels of parallelism : MPI and OpenMP. I try to add an other second level with OpenACC directives on Tesla k40 GPU for comparison with OpenMP.
The heart of the application (largest cpu time consuming part, quite 60%) is the computation of the fluxes in the 3 directions.
For each direction, we have two nested loops (two other directions), with a vector extraction, and the computation of the fluxes for all points of this vector. So three times this construction.
The computation of the fluxes is done by an other subroutine, called inside these two neested loops. This subroutine uses temporary variables, scalars and vectors, that should be private for each call.
Simply, for the first direction, we have
DO k = 1, Nz
DO j = 1, Ny
!
DO L = 1, 5 ! Five unknowns per mesh cell
DO i = 1, Nx
Q1(i,L) = Q(i,L,j,k)
END DO
END DO
!
CALL compute_flux (Nx,Q1,F)
!
DO L = 1, 5
DO i = 1, Nx
RQ(i,L,j,k) = linear function of (F(i,L) )
END DO
END DO
!
END DO
END DO
In the subroutine compute_flux, we have temporary arrays used in several loops that can be vectorized by the compiler. Each array should be private for the thread that makes the call.
We have (N is the dimension received by argument)
11 local arrays of dimension (N)
2 local arrays of dimension (N,5)
2 local arrays of dimension (N,6,5)
In the calling subroutine I choose to parallelize the two nested loops,
DO k = 1, Nz
DO j = 1, Ny
In this case, I think I can use the following OpenACC directives :
!$ACC ROUTINE(compute_flux) VECTOR
!$ACC DATA PRESENT(Q,RQ,...)
!$ACC PARALLEL PRIVATE(Q1,F,...)
!$ACC LOOP COLLAPSE(2) PRIVATE(i,L)
DO k = 1, Nz
DO j = 1, Ny
!
DO L = 1, 5
DO i = 1, Nx
Q1(i,L) = Q(i,L,j,k)
END DO
END DO
!
CALL compute_flux (Nx,Q1,F)
!
DO L = 1, 5
DO i = 1, Nx
RQ(i,L,j,k) = linear function of (F(i,L) )
END DO
END DO
!
END DO
END DO
!$ACC END LOOP
...
...
!$ACC LOOP COLLAPSE(2) PRIVATE(j,L)
DO k = 1, Nz
DO i = 1, Nx
END DO
END DO
!$ACC END LOOP
...
...
!$ACC LOOP COLLAPSE(2) PRIVATE(k,L)
DO j = 1, Ny
DO i = 1, Nx
END DO
END DO
!$ACC END LOOP
!$ACC END PARALLEL
!$ACC END DATA
In the subroutine compute_flux, one has :
!$ACC ROUTINE VECTOR
and directives like
!$ACC LOOP VECTOR
for the different loops in it.
In compute_flux, there are several loops. Local arrays computed in a loop are reused in following loops, so they have to be preserved during the execution of this routine.
My problem is that I do not know how to make private (to the thread making the call of compute_flux) the local arrays of this subroutine and being efficient.
I try adding clauses concerning the number of gangs and the number of workers.
I can go up to 64 gangs / 1 worker, next I get an internal error from Cuda. In the best case, so 64 gangs, The code runs twice as slow as 1 MPI process alone.
If I use more than one worker, I get an internal Cuda error. I think that it comes from local arrays that are shared.
I want to increase my mesh size, from (100x150x100) to (1000x300x26) and more after. I can’t go further than 4 gangs / 1 worker (Cuda internal error) and computing time is very bad with regard to the one obtained using the 20 CPU cores of the running node.
Could you please give me some advices ?
Regards,
Guy.