OpenACC: Best way to parallelize nested DO loops with data dependency between loops?

I am trying to parallelize the following simple nested DO loop. Compared to serial run on a CPU, the following code when executed on a V100 gpu is not giving me any speed up, despite there is good potential for parallelization in the code/algorithm. Is there a different and better way to parallelize this loop? (the full code is also uploaded as attachment)

compiler: pgfortran
flags:-acc -O3 -ta=tesla:managed -Minfo=accel

Input variables: nblocks = 1, nprims = 5, NI(1) = 300, NJ(1) = 300, NK(1) = 1

Also the variable ā€˜nblā€™ present in the outermost loop is present in the other DO loops, so there is dependency and the outer loop canā€™t be included into the collapse clause. Is there any thing that can be done to exploit the parallelism from the outer loop as well?

Thanks!

Code:

DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
	Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO

Code.zip (1.5 KB)

Is there any thing that can be done to exploit the parallelism from the outer loop as well?

The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.

Try something like the following and split the parallelism into two levels:

!$acc parallel loop collapse(2)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
!$acc loop collapse(3)
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

Looking at performance, this change doesnā€™t improve it much, but only because the nblocks is 1. After fixing some bugs in your code (ā€œn_primā€ should be ā€œn_comsā€, set nblocks before the allocation, initialize all of NK, NJ, and NI, not just the first element), and increasing the nblocks to 128, my times are as follows on a Skylake + V100 system using NVHPC 21.7.

Serial CPU: 375 seconds
OpenACC V100 (-acc=gpu): 5.4 seconds

Compiler flags:
# OpenACC GPU
FFLAG = -acc -fast -mcmodel=medium -Minfo=accel
#Serial CPU
#FFLAG = -fast -mcmodel=medium -Minfo=accel

test_collapse.f90 (3.1 KB)

1 Like

I typically use small nblocks for my application: generally in the range of 5-10. I am not able to understand why I am not getting parallel performance from the inner collapse loop it self (for i,j and k indices), it seem to have good potential (what do you think?). Because, I typically use large values for NI, NJ and NK, thats where my key parallelism is lies. Lets say I am using NI = NJ = NK = 100, then NI x NJ x NK will yeild 10^6, which are number of independent operations available for parallelism. Why am I not getting performance with nblocks = 1 and NI(1) = NJ(1) = NK(1) = 100.

I am getting following times with those settings (code attached):
CPU Serial: 35 secs
GPU parallel: 22.21 secs

Compiler flags:
CPU serial: -acc -fast -mcmodel=medium -Minfo=accel
GPU parallel: -fast -mcmodel=medium -Minfo=accel

But, when NI x NJ x NK is much lesser (when NK = 1) the cpu performance is looking better, I am confused why despite large NI x NJ value performance is less.

Also, I would like to know if splitting the parallelism into multiple levels a good practice? (in general)

Thanks!
Makefile (449 Bytes)
Module.f90 (739 Bytes)
test_collapse.f90 (3.1 KB)

Ok, since nblocks will be small, weā€™ll want to take a different strategy.

CUDA, which our OpenACC implementation targets when using NVIDIA GPUS, organizes parallelism into blocks and thread where the threads are grouped into a block. Blocks can be run on the individual multiprocessors (SM) on the device with a max of 16 or 32 blocks per SM executing depending on the device. So if youā€™re using a V100 with 80 SMs, this means you can run approximately 1280 concurrent blocks. (Thereā€™s other factors involved but do a web search on CUDA occupancy for more details, but for now letā€™s keep things simple)

For OpenACC, we map a ā€œgangā€ loop to the CUDA blocks and ā€œvectorā€ loop to the CUDA threads. So in this case:

!$acc parallel loop collapse(2)   << implicitly using "gang"
DO nbl = 1,nblocks      
DO n_prim = 1,nprims                 << 1 nblock * 5 nprims, means only 5 of 1280 possible blocks can be used
!$acc loop collapse(3)     << implicit "vector" loop of 128 threads 
DO k = 1, NK(nbl)             << 1
DO j = 1, NJ(nbl)               << 300
DO i = 1, NI(nbl)                << 300   each thread will do ~700 iterations ( 300*300 / 128)

So basically, with only 5 gangs, youā€™re severely under-utilizing the GPU. Even with nblocks = 10, thatā€™s only 50 blocks which better, but is still pretty small.

If you didnā€™t have the bounds look-up arrays, Iā€™d say the strategy would then be to collapse the outer 4 loops giving you 1500 blocks, and use an inner vector loop of 300 (you want the vector loop to match the stride-1 dimension of your arrays, which ā€œiā€ does). But with the look-up arrays causing the dependency, going back to youā€™re first thought of not parallelizing the ā€œnblā€ loop would be the way to go. Though, Iā€™d add the ā€œasyncā€ clause so that you can launch multiple potentially overlapping kernels (note: put an ā€œ!$acc waitā€ just after the call to ā€œCompute_Conservsā€ and before ā€œCPU_TIMEā€)

        integer queue
        DO nbl = 1,nblocks
           queue = nbl
        !$acc parallel loop gang collapse(3) async(queue)
        DO n_prim = 1,nprims
        DO k = 1, NK(nbl)
        DO j = 1, NJ(nbl)
        !$acc loop vector
        DO i = 1, NI(nbl)

                Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)

        ENDDO
        ENDDO
        ENDDO
        ENDDO
        ENDDO

        DO nbl = 1,nblocks
           queue = nblocks+nbl
        !$acc parallel loop collapse(3) async(queue)
        DO n_prim = 1,nprims
        DO k = 1, NK(nbl)
        DO j = 1, NJ(nbl)
        !$acc loop vector
        DO i = 1, NI(nbl)
                Py(i,j,k,nbl,n_prim) = j*k + Cy(i,j,k,nbl,1)*Cy(i,j,k,nbl,5) + Cy(i,j,k,nbl,3)
        ENDDO
        ENDDO
        ENDDO
        ENDDO
        ENDDO

        DO nbl = 1,nblocks
           queue = (2*nblocks)+nbl
        !$acc parallel loop collapse(3) async(queue)
        DO n_prim = 1,nprims
        DO k = 1, NK(nbl)
        DO j = 1, NJ(nbl)
        !$acc loop vector
        DO i = 1, NI(nbl)
                Pz(i,j,k,nbl,n_prim) = k*i + Cz(i,j,k,nbl,1)*Cz(i,j,k,nbl,5) + Cz(i,j,k,nbl,4)
        ENDDO
        ENDDO
        ENDDO
        ENDDO
        ENDDO

Now, these kernels are really small with very little in them so the performance characteristics may change when moving to your full application. But hopefully this give you some ideas.

Though another thought is if you can use the ā€˜maxā€™ variables for the loop bound, and then put an if check inside to ensure the indices are within the bounds, would allow you to then parallelize all five loops, or the outer 4 and an inner vector loop. Not sure it will work for the full app, but maybe?

        !$acc parallel loop gang vector collapse(5) async
        DO nbl = 1,nblocks
        DO n_prim = 1,nprims
        DO k = 1, NKmax
        DO j = 1, NJmax
        DO i = 1, NImax
                if (k.lt.NK(nbl).and.j.lt.NJ(nbl).and.j.lt.NI(nbl)) then
                Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
                endif

        ENDDO
        ENDDO
        ENDDO
        ENDDO
        ENDDO
        !$acc parallel loop gang collapse(4) async
        DO nbl = 1,nblocks
        DO n_prim = 1,nprims
        DO k = 1, NKmax
        DO j = 1, NJmax
        !$acc loop vector
        DO i = 1, NImax
                if (k.lt.NK(nbl).and.j.lt.NJ(nbl).and.j.lt.NI(nbl)) then
                Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
                endif

        ENDDO
        ENDDO
        ENDDO
        ENDDO
        ENDDO

-Mat

Wow!, Mat, that was really helpful and I have learnt some stuff today. Thanks for patently putting in all the info.

I implemented your strategy (the first one without the IFs in the loop) and I am able to see some performance (upto 9x with async clause and 1.5x without).

So by this logic, which of the following two codes is more effecient:

code1:

!$acc parallel loop gang collapse(2)
DO k = 1,100
DO i = 1,100
!$acc loop vector
DO j = 1,6
	
	<some operations>
		
ENDDO
ENDDO
ENDDO

code2:

!$acc parallel loop gang collapse(2)
DO k = 1,100
DO j = 1,6
!$acc loop vector
DO i = 1,100
	
	<some operations>
		
ENDDO
ENDDO
ENDDO

You want the ā€˜vectorā€™ loopā€™s index to access the stride-1 dimension of the arrays. So assuming ā€œiā€ is accessing the first dimension of your Fortran arrays, youā€™ll want the code 2 version.

However, by default, the compiler will use a vector length of 128, so since the loop bound is 100, youā€™d be wasting some vectors. So in this case, Iā€™d try adding a ā€˜workerā€™ level on the ā€œjā€ loop. A worker is a grouping of vectors and maps to the y dimension of a CUDA thread block (vector maps to the x-dimension).

So something like:

 !$acc parallel loop gang num_workers(6)
DO k = 1,100
!$acc loop worker
DO j = 1,6
!$acc loop vector
DO i = 1,100

By default, the compiler will use a block size of 32x4 (i.e. a vector length of 32, with 4 workers), but since J is 6, it might be better to set the number of workers to 6. Though, 192 threads isnā€™t evenly divisible into 2048 (the max number of threads that can run on a SM), so youā€™ll be wasting a few cores on the SM, if you were able to 100% occupancy. For lower occupancies, they would have be wasted anyways.

The caveat here is that now the number of gangs has been lowered from 600 to 100, so this might offset any gains from pushing more parallelism into the block itself.

In general, the optimal loop schedule is very dependent on the code and the loop trip counts. Though the good news is that OpenACC makes it very easy to experiment with different schedules. After porting, Iā€™ll usually spend an hour or so just trying different ones to see what happens. So thatā€™s what Iā€™d recommend you do, just try different things to see how the performance changes.

I also like to use our runtimeā€™s simple profiler by setting the environment variable ā€œNV_ACC_TIME=1ā€. The caveat being that it will implicitly disable async so youā€™ll just be looking at the kernel times and not the speed-up gained through concurrency. But this is what you want anyway.

1 Like

While I am experimenting with loop scheduling today I have observed a few things, I would like to know whats happening behindā€¦

Following is my loop OpenACC schedule initially:

!$acc parallel loop gang collapse(3) default(present)
DO nbl = 1,nblocks   !---> nblocks = 1
DO k = 1, NImax      !---> NKmax = 64 or 128 
DO j = 1, NJmax      !---> NJmax = 64 or 128
!$acc loop vector
DO i = 1, NKmax     !---> NImax = 64 or 128

       Array(i,j,k,nbl) = <some code> !----> 'i' is in the stride one

When I am using the above loop schedule for NImax = NJmax = NKmax = 64 its all good and I have seen 32x speedup. But when I changed NImax = NJmax = NKmax = 128, I expected higher speed up but it was only around 20x. Then I tried changing the loop scheduling to the following configuration.

!$acc parallel loop gang vector collapse(4) default(present)
DO nbl = 1,nblocks   !---> nblocks = 1
DO k = 1, NImax      !---> NKmax = 64 or 128 
DO j = 1, NJmax      !---> NJmax = 64 or 128
DO i = 1, NKmax     !---> NImax = 64 or 128

       Array(i,j,k,nbl) = <some code> !----> 'i' is in the stride one

With this loop schedule above the program became even slower. First of all I want to why this is happening. Secondly I would like to know some tips of how to optimize loop scheduling with changing NImax, NJmax and NKmax and try to use maximum number of threads possible.

Whatā€™s the difference between the two code blocks? They look identical to me.

In any case, I probably canā€™t answer as to what would account for the performance difference. I might be able to guess, but instead, you should run both versions through Nisight-Compute and compare the two profiles.

Sorry, my bad. I have edited the 2nd loop schedule. can you have a look at it pleaseā€¦

I am running the programs on a linux cluster remotely from a windows laptop. Is it still possible to profile using Nsight?

Ok, so in the second version, the loops are collapsed and then the vectors are strip mined (basically think of it as if a loop is added having the trip count equal to the vector length with each vector taking one iteration). I wouldnā€™t think that there would be much difference between the two, especially since the default vector length is 128, but maybe second case is needing to use more registers.

I am running the programs on a linux cluster remotely from a windows laptop. Is it still possible to profile using Nsight?
Sure, this is how use it. First use the command line ā€˜ncuā€™ profiler to generate the profile, then copy it a location which can be accessed by your laptop, and open it in the Nsight-Compute GUI. You can also print the profile out to the command line as well, but I prefer the GUI view.

If youā€™ve not used Nsight-Compute before, Iā€™d suggest reviewing the online Documentation:
https://docs.nvidia.com/nsight-compute/2021.2/index.html

Overview, videos, and Blogs:

1 Like

I have compared the profiles. Everything seems to be fine but, some kernels with more number of variables in them and more lines of code in them are running slower. The other kernels with same number of total iterations and same loop scheduling but significantly less number of arrays, lines of code and scalar variables in them are running so much faster. I think I have to optimize those large kernels to get speed. Do you think eliminating arrays inside the compute regions and representing them as scalars increase the speed?

Typically NImax, NJmax and NKmax in my app are in the range of 100 to 500, and product NImaxNJmaxNKmax will be ranging from 1 Million to 40 Million. After seeing the performance of my app with NImax = NJmax = NKmax = 128, with the following loop schedule (only about 20x), I see the option option of going to multi-GPU simulations with x64 and OpenACC hybrid parallelization is worth it. What do you think?

!$acc parallel loop gang collapse(3) default(present)
DO nbl = 1,nblocks   !---> nblocks = 1
DO k = 1, NImax      
DO j = 1, NJmax     
!$acc loop vector
DO i = 1, NKmax     

Probably the register usage is much higher thus lowering the occupancy. You can see the number of registers used under the ā€˜occupancyā€™ drop-down in the profile.

Again, Iā€™ve been giving a simplistic view about occupancy only focusing on the threads. Registers and shared memory is another shared finite resource. For 100% occupancy, each thread can only be using 32 registers. Though lower occupancy does not necessarily mean lower performance. Often for HPC applications, 50% occupancy is quite good and Iā€™ve seen good performance even at 25%.

I think I have to optimize those large kernels to get speed. Do you think eliminating arrays inside the compute regions and representing them as scalars increase the speed?

Maybe? The private arrays will be stored in global memory so need to be fetched, though hopefully then be stored in L2 cache. Making them scalars, theyā€™ll be store in registers, which is faster, but now youā€™ve added more registers which could lower your occupancy.

Note that the only way to reduce registers is to use fewer local variables. So try to eliminate and reuse scalars if possible.

I see the option option of going to multi-GPU simulations with x64 and OpenACC hybrid parallelization is worth it. What do you think?

Most of codes I work with are hybrid MPI+OpenACC and use multiple GPUs, so yes, itā€™s worth it. Though Iā€™ve not found it beneficial to mix CPU threading with multiple-GPU. Like compiling a unified multi-core and gpu OpenACC program, then have one rank go parallel across the CPU cores and the others offload to the GPUs. It can be done, but you run into a load-balancing problem where the application gets bottle-necked waiting on the CPU unless itā€™s given less work. The logic to do the load balancing often gets very complex and is very system dependent. Iā€™ve not found it worth the effort so end-up just using the GPUs.

I think thatā€™s the best way to do it. I do see performance with the elimination of private arrays and eleminating the scalars. I just saw a mind blowing 80x performance by doing that (with NI = NJ = NK = 128).

I am also planning to use MPI+OpenACC paradigm because of high grid resolutions that my app should face.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.