Ok, since nblocks will be small, weāll want to take a different strategy.
CUDA, which our OpenACC implementation targets when using NVIDIA GPUS, organizes parallelism into blocks and thread where the threads are grouped into a block. Blocks can be run on the individual multiprocessors (SM) on the device with a max of 16 or 32 blocks per SM executing depending on the device. So if youāre using a V100 with 80 SMs, this means you can run approximately 1280 concurrent blocks. (Thereās other factors involved but do a web search on CUDA occupancy for more details, but for now letās keep things simple)
For OpenACC, we map a āgangā loop to the CUDA blocks and āvectorā loop to the CUDA threads. So in this case:
!$acc parallel loop collapse(2) << implicitly using "gang"
DO nbl = 1,nblocks
DO n_prim = 1,nprims << 1 nblock * 5 nprims, means only 5 of 1280 possible blocks can be used
!$acc loop collapse(3) << implicit "vector" loop of 128 threads
DO k = 1, NK(nbl) << 1
DO j = 1, NJ(nbl) << 300
DO i = 1, NI(nbl) << 300 each thread will do ~700 iterations ( 300*300 / 128)
So basically, with only 5 gangs, youāre severely under-utilizing the GPU. Even with nblocks = 10, thatās only 50 blocks which better, but is still pretty small.
If you didnāt have the bounds look-up arrays, Iād say the strategy would then be to collapse the outer 4 loops giving you 1500 blocks, and use an inner vector loop of 300 (you want the vector loop to match the stride-1 dimension of your arrays, which āiā does). But with the look-up arrays causing the dependency, going back to youāre first thought of not parallelizing the ānblā loop would be the way to go. Though, Iād add the āasyncā clause so that you can launch multiple potentially overlapping kernels (note: put an ā!$acc waitā just after the call to āCompute_Conservsā and before āCPU_TIMEā)
integer queue
DO nbl = 1,nblocks
queue = nbl
!$acc parallel loop gang collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
DO nbl = 1,nblocks
queue = nblocks+nbl
!$acc parallel loop collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)
Py(i,j,k,nbl,n_prim) = j*k + Cy(i,j,k,nbl,1)*Cy(i,j,k,nbl,5) + Cy(i,j,k,nbl,3)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
DO nbl = 1,nblocks
queue = (2*nblocks)+nbl
!$acc parallel loop collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)
Pz(i,j,k,nbl,n_prim) = k*i + Cz(i,j,k,nbl,1)*Cz(i,j,k,nbl,5) + Cz(i,j,k,nbl,4)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
Now, these kernels are really small with very little in them so the performance characteristics may change when moving to your full application. But hopefully this give you some ideas.
Though another thought is if you can use the āmaxā variables for the loop bound, and then put an if check inside to ensure the indices are within the bounds, would allow you to then parallelize all five loops, or the outer 4 and an inner vector loop. Not sure it will work for the full app, but maybe?
!$acc parallel loop gang vector collapse(5) async
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NKmax
DO j = 1, NJmax
DO i = 1, NImax
if (k.lt.NK(nbl).and.j.lt.NJ(nbl).and.j.lt.NI(nbl)) then
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
endif
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc parallel loop gang collapse(4) async
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NKmax
DO j = 1, NJmax
!$acc loop vector
DO i = 1, NImax
if (k.lt.NK(nbl).and.j.lt.NJ(nbl).and.j.lt.NI(nbl)) then
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
endif
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
-Mat