DO CONCURRENT loop order

mikostul · October 4, 2022, 3:42am

Hello,

I am trying to parallelize a code with DO CONCURRENT and I am trying to find the correct order for allocating and looping through indices with DO CONCURRENT. For now, I am having nblk =1. Here is an example loop from the code :

     allocate (flux_t(nblk,nt,npm))
     do concurrent (i=1:nblk,k=1:npm,j=3:nt-2)
!
        p0m = (one + D_C_MCt(j-1))*LN(i,j-1,k) - D_C_MCt(j-1)*LN(i,j-2,k)
        p1m =        D_C_MCt(j  ) *LN(i,j-1,k) + D_C_CPt(j-1)*LN(i,j  ,k)
        p0p = (one + D_C_CPt(j  ))*LP(i,j  ,k) - D_C_CPt(j  )*LP(i,j+1,k)
        p1p =        D_C_CPt(j-1) *LP(i,j  ,k) + D_C_MCt(j  )*LP(i,j-1,k)
!
        B0m = four*(D_C_MCt(j-1)*(LN(i,j-1,k) - LN(i,j-2,k)))**2
        B1m = four*(D_C_CPt(j-1)*(LN(i,j  ,k) - LN(i,j-1,k)))**2
        B0p = four*(D_C_CPt(j  )*(LP(i,j+1,k) - LP(i,j  ,k)))**2
        B1p = four*(D_C_MCt(j  )*(LP(i,j  ,k) - LP(i,j-1,k)))**2
!
        w0m = D_P_Tt(j-1) *(one/(weno_eps + B0m)**2)
        w1m = D_MC_Tt(j-1)*(one/(weno_eps + B1m)**2)
        w0p = D_M_Tt(j) *(one/(weno_eps + B0p)**2)
        w1p = D_CP_Tt(j)*(one/(weno_eps + B1p)**2)
!
        wm_sum = w0m + w1m
        wp_sum = w0p + w1p
!
        OM0m = w0m/wm_sum
        OM1m = w1m/wm_sum
        OM0p = w0p/wp_sum
        OM1p = w1p/wp_sum
!
        um = OM0m*p0m + OM1m*p1m
        up = OM0p*p0p + OM1p*p1p
!
        flux_t(i,j,k) = up + um
!
      enddo

I looked at two ways of allocating the arrays. I tried the order of (nblk,nt,npm) and (nt,npm,nblk), and I looked at all the ways of looping through the indices. I find changing the order within the DO CONCURRENT has a drastic change on performance. I am trying to figure out the correct memory stride. Here are some of my timings in seconds for the entire POT3D code with consistent loops throughout the code:

On the CPU in serial with nblk = 1:

******* Allocate (nblk,nt,npm) ####################

Triple nested do loop (nest order : npm, nt, nblk)	 : 	465.57	

DO CONCURRENT (np, nt, nblk) 				 : 	421.21	

DO CONCURRENT (nblk, np, nt) 				 : 	210.69	

DO CONCURRENT (nblk, nt, np) 				 : 	610.36	

DO CONCURRENT (nt, np, nblk) 				 : 	698.40	

DO CONCURRENT (nt, nblk, np) 				 : 	615.77	

DO CONCURRENT (np, nblk, nt) 				 : 	214.16	

******* Allocate (nt,npm,nblk) ####################

Triple nested do loop (nest order : nblk, npm, nt)	 : 	170.41	

DO CONCURRENT (np, nt, nblk) 				 : 	344.41	

DO CONCURRENT (nblk, np, nt) 				 : 	178.87	

DO CONCURRENT (nblk, nt, np) 				 : 	587.08

DO CONCURRENT (nt, np, nblk) 				 : 	654.52

DO CONCURRENT (nt, nblk, np) 				 : 	599.88

DO CONCURRENT (np, nblk, nt) 			         : 	180.46

On the GPU with nblk = 1:

******* Allocate (nblk,nt,npm) ####################

DO CONCURRENT (np, nt, nblk) 				 : 	9.57

DO CONCURRENT (nblk, np, nt) 				 : 	9.53

DO CONCURRENT (nblk, nt, np) 				 : 	11.02	

DO CONCURRENT (nt, np, nblk) 				 : 	11.22	

DO CONCURRENT (nt, nblk, np) 				 : 	11.18	

DO CONCURRENT (np, nblk, nt) 				 : 	9.55		

******* Allocate (nt,npm,nblk)  ####################

DO CONCURRENT (np, nt, nblk) 				 : 	9.39	

DO CONCURRENT (nblk, np, nt) 				 : 	9.40	

DO CONCURRENT (nblk, nt, np) 				 : 	9.39

DO CONCURRENT (nt, np, nblk) 				 : 	9.48

DO CONCURRENT (nt, nblk, np) 				 : 	9.41

DO CONCURRENT (np, nblk, nt) 				 : 	9.36

I also checked if the same happens with gfortran on the CPU when I run in serial. I found that when I allocate (nblk,nt,npm), no matter what the DO CONCURRENT order is the timing is always roughly 360s, and when I allocate (nt,npm,nblk), no matter what the DO CONCURRENT order is the timing is always roughly 189s.

Any help or guidance in the correct memory stride with DO CONCURRENT would be appreciated. I find it interesting that the GPU timings tend to be consistent, while the serial CPU timings vary wildly in timings.

Miko

MatColgrove · October 4, 2022, 8:16pm

Hi Miko,

I’ll ask if you can send me this new version of POT3D so I can do a better analysis. For now I’ll speculate with the caveat that I haven’t done much work with DO CONNCURRENT targeting multi-core CPU.

From the GPU perspective, I’d think given the striding of LN and LP being “(i,j,k)” the ideal would be “do concurrent (k=…, j=…, ,i=…)”. Though like OpenACC’s “kernels” directive, the compiler has the freedom to interchange loops so likely why you see less variance between the schedules.

Given nblk is 1, it makes sense why using it in the stride-1 dimension gives worse performance. Unless it’s a parameter, the compiler wont know nblk is 1 and hence will likely schedule it across the vectors (CUDA threads).

For the CPU side, I don’t think the compiler is able to perform the loop reordering so doing a straight OpenMP collapse(3). While I’m not entirely sure, but given your data it appears that the data blocking is done inner out, so why putting np before nt gives better results. Given nblk=1, it’s placement has less effect. CPU vectorization may be coming into play as well if the memory blocks are contiguous.

Again, if you can send me the code, I can do a better analysis given I haven’t looked at this before. Plus, I can then pass it on to the compiler team for performance opportunities.

-Mat

mikostul · October 4, 2022, 9:41pm

Hello Mat,

Sorry, I misspoke. It’s not a version of Pot3D. I will have to check if I am allowed to share the code first. In the meantime, this does help in my understanding.

Best,
Miko

ivan.pribec1 · August 28, 2024, 9:42pm

Sorry to bump this issue. I was also wondering about this.

In one of the Nvidia blog-posts (https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk), the opposite order is used:

27      do concurrent(i=2 : n-1, j=2 : m-1)
28        a(i,j) = w0 * b(i,j) + &
29                 w1 * (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1)) + &
30                 w2 * (b(i-1,j-1) + b(i-1,j+1) + b(i+1,j-1) + b(i+1,j+1))
31      enddo

I noticed one does get slightly different optimization information for the two major choices (-Minfo=stdpar).

For a Fortran-array indexed as (i,j,k), when I have a triple loop like this,

      do concurrent(k=1:8,j=1:ny,i=1:nx)
! ...
      end do

I get the message,

     88, Generating NVIDIA GPU code
         88,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x

But when I re-order the loop as i=1:nx,j=1:ny,k=1:8, the message changes to,

     88, Generating NVIDIA GPU code
         88,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost

In practice, just like Miko, I see no noticeable difference on the GPU performance.

The order matters on CPU when using other compilers (see for instance Intel).

Topic		Replies	Views
Order of operations within do concurrent on GPU nvc, nvc++ and nvfortran	8	564	July 31, 2023
Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK Technical Blog	27	2250	June 7, 2024
DO CONCURRENT matmul slow on Grace Hopper nvc, nvc++ and nvfortran	3	124	July 9, 2024
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3179	October 4, 2021
Do concurrent with gpu or multicore nvc, nvc++ and nvfortran	4	127	July 8, 2024
[Fortran][do concurrent] Questions regarding compile options for managing offloading and performance nvc, nvc++ and nvfortran cuda	3	689	May 2, 2023
Using Fortran Standard Parallel Programming for GPU Acceleration Technical Blog	8	570	July 11, 2024
Fortran DO CONCURRENT REDUCE Implementation help nvc, nvc++ and nvfortran	6	1172	May 7, 2024
Number of threads in `do concurrent` loops nvc, nvc++ and nvfortran	1	562	May 8, 2023
Fortran DO CONCURRENT with GPUs are reductions allowed? nvc, nvc++ and nvfortran	4	862	May 28, 2021

DO CONCURRENT loop order

Related topics