Ok, since nblocks will be small, weâll want to take a different strategy.
CUDA, which our OpenACC implementation targets when using NVIDIA GPUS, organizes parallelism into blocks and thread where the threads are grouped into a block. Blocks can be run on the individual multiprocessors (SM) on the device with a max of 16 or 32 blocks per SM executing depending on the device. So if youâre using a V100 with 80 SMs, this means you can run approximately 1280 concurrent blocks. (Thereâs other factors involved but do a web search on CUDA occupancy for more details, but for now letâs keep things simple)
For OpenACC, we map a âgangâ loop to the CUDA blocks and âvectorâ loop to the CUDA threads. So in this case:
!$acc parallel loop collapse(2) << implicitly using "gang"
DO nbl = 1,nblocks
DO n_prim = 1,nprims << 1 nblock * 5 nprims, means only 5 of 1280 possible blocks can be used
!$acc loop collapse(3) << implicit "vector" loop of 128 threads
DO k = 1, NK(nbl) << 1
DO j = 1, NJ(nbl) << 300
DO i = 1, NI(nbl) << 300 each thread will do ~700 iterations ( 300*300 / 128)
So basically, with only 5 gangs, youâre severely under-utilizing the GPU. Even with nblocks = 10, thatâs only 50 blocks which better, but is still pretty small.
If you didnât have the bounds look-up arrays, Iâd say the strategy would then be to collapse the outer 4 loops giving you 1500 blocks, and use an inner vector loop of 300 (you want the vector loop to match the stride-1 dimension of your arrays, which âiâ does). But with the look-up arrays causing the dependency, going back to youâre first thought of not parallelizing the ânblâ loop would be the way to go. Though, Iâd add the âasyncâ clause so that you can launch multiple potentially overlapping kernels (note: put an â!$acc waitâ just after the call to âCompute_Conservsâ and before âCPU_TIMEâ)
integer queue
DO nbl = 1,nblocks
queue = nbl
!$acc parallel loop gang collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
DO nbl = 1,nblocks
queue = nblocks+nbl
!$acc parallel loop collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)
Py(i,j,k,nbl,n_prim) = j*k + Cy(i,j,k,nbl,1)*Cy(i,j,k,nbl,5) + Cy(i,j,k,nbl,3)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
DO nbl = 1,nblocks
queue = (2*nblocks)+nbl
!$acc parallel loop collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)
Pz(i,j,k,nbl,n_prim) = k*i + Cz(i,j,k,nbl,1)*Cz(i,j,k,nbl,5) + Cz(i,j,k,nbl,4)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
Now, these kernels are really small with very little in them so the performance characteristics may change when moving to your full application. But hopefully this give you some ideas.
Though another thought is if you can use the âmaxâ variables for the loop bound, and then put an if check inside to ensure the indices are within the bounds, would allow you to then parallelize all five loops, or the outer 4 and an inner vector loop. Not sure it will work for the full app, but maybe?
!$acc parallel loop gang vector collapse(5) async
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NKmax
DO j = 1, NJmax
DO i = 1, NImax
if (k.lt.NK(nbl).and.j.lt.NJ(nbl).and.j.lt.NI(nbl)) then
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
endif
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc parallel loop gang collapse(4) async
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NKmax
DO j = 1, NJmax
!$acc loop vector
DO i = 1, NImax
if (k.lt.NK(nbl).and.j.lt.NJ(nbl).and.j.lt.NI(nbl)) then
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
endif
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
-Mat