DO CONCURRENT loop order

Hello,

I am trying to parallelize a code with DO CONCURRENT and I am trying to find the correct order for allocating and looping through indices with DO CONCURRENT. For now, I am having nblk =1. Here is an example loop from the code :

     allocate (flux_t(nblk,nt,npm))
     do concurrent (i=1:nblk,k=1:npm,j=3:nt-2)
!
        p0m = (one + D_C_MCt(j-1))*LN(i,j-1,k) - D_C_MCt(j-1)*LN(i,j-2,k)
        p1m =        D_C_MCt(j  ) *LN(i,j-1,k) + D_C_CPt(j-1)*LN(i,j  ,k)
        p0p = (one + D_C_CPt(j  ))*LP(i,j  ,k) - D_C_CPt(j  )*LP(i,j+1,k)
        p1p =        D_C_CPt(j-1) *LP(i,j  ,k) + D_C_MCt(j  )*LP(i,j-1,k)
!
        B0m = four*(D_C_MCt(j-1)*(LN(i,j-1,k) - LN(i,j-2,k)))**2
        B1m = four*(D_C_CPt(j-1)*(LN(i,j  ,k) - LN(i,j-1,k)))**2
        B0p = four*(D_C_CPt(j  )*(LP(i,j+1,k) - LP(i,j  ,k)))**2
        B1p = four*(D_C_MCt(j  )*(LP(i,j  ,k) - LP(i,j-1,k)))**2
!
        w0m = D_P_Tt(j-1) *(one/(weno_eps + B0m)**2)
        w1m = D_MC_Tt(j-1)*(one/(weno_eps + B1m)**2)
        w0p = D_M_Tt(j) *(one/(weno_eps + B0p)**2)
        w1p = D_CP_Tt(j)*(one/(weno_eps + B1p)**2)
!
        wm_sum = w0m + w1m
        wp_sum = w0p + w1p
!
        OM0m = w0m/wm_sum
        OM1m = w1m/wm_sum
        OM0p = w0p/wp_sum
        OM1p = w1p/wp_sum
!
        um = OM0m*p0m + OM1m*p1m
        up = OM0p*p0p + OM1p*p1p
!
        flux_t(i,j,k) = up + um
!
      enddo

I looked at two ways of allocating the arrays. I tried the order of (nblk,nt,npm) and (nt,npm,nblk), and I looked at all the ways of looping through the indices. I find changing the order within the DO CONCURRENT has a drastic change on performance. I am trying to figure out the correct memory stride. Here are some of my timings in seconds for the entire POT3D code with consistent loops throughout the code:

On the CPU in serial with nblk = 1:

******* Allocate (nblk,nt,npm) ####################

Triple nested do loop (nest order : npm, nt, nblk)	 : 	465.57	

DO CONCURRENT (np, nt, nblk) 				 : 	421.21	

DO CONCURRENT (nblk, np, nt) 				 : 	210.69	

DO CONCURRENT (nblk, nt, np) 				 : 	610.36	

DO CONCURRENT (nt, np, nblk) 				 : 	698.40	

DO CONCURRENT (nt, nblk, np) 				 : 	615.77	

DO CONCURRENT (np, nblk, nt) 				 : 	214.16	

******* Allocate (nt,npm,nblk) ####################

Triple nested do loop (nest order : nblk, npm, nt)	 : 	170.41	

DO CONCURRENT (np, nt, nblk) 				 : 	344.41	

DO CONCURRENT (nblk, np, nt) 				 : 	178.87	

DO CONCURRENT (nblk, nt, np) 				 : 	587.08

DO CONCURRENT (nt, np, nblk) 				 : 	654.52

DO CONCURRENT (nt, nblk, np) 				 : 	599.88

DO CONCURRENT (np, nblk, nt) 			         : 	180.46

On the GPU with nblk = 1:

******* Allocate (nblk,nt,npm) ####################

DO CONCURRENT (np, nt, nblk) 				 : 	9.57

DO CONCURRENT (nblk, np, nt) 				 : 	9.53

DO CONCURRENT (nblk, nt, np) 				 : 	11.02	

DO CONCURRENT (nt, np, nblk) 				 : 	11.22	

DO CONCURRENT (nt, nblk, np) 				 : 	11.18	

DO CONCURRENT (np, nblk, nt) 				 : 	9.55		

******* Allocate (nt,npm,nblk)  ####################

DO CONCURRENT (np, nt, nblk) 				 : 	9.39	

DO CONCURRENT (nblk, np, nt) 				 : 	9.40	

DO CONCURRENT (nblk, nt, np) 				 : 	9.39

DO CONCURRENT (nt, np, nblk) 				 : 	9.48

DO CONCURRENT (nt, nblk, np) 				 : 	9.41

DO CONCURRENT (np, nblk, nt) 				 : 	9.36

I also checked if the same happens with gfortran on the CPU when I run in serial. I found that when I allocate (nblk,nt,npm), no matter what the DO CONCURRENT order is the timing is always roughly 360s, and when I allocate (nt,npm,nblk), no matter what the DO CONCURRENT order is the timing is always roughly 189s.

Any help or guidance in the correct memory stride with DO CONCURRENT would be appreciated. I find it interesting that the GPU timings tend to be consistent, while the serial CPU timings vary wildly in timings.

  • Miko

Hi Miko,

I’ll ask if you can send me this new version of POT3D so I can do a better analysis. For now I’ll speculate with the caveat that I haven’t done much work with DO CONNCURRENT targeting multi-core CPU.

From the GPU perspective, I’d think given the striding of LN and LP being “(i,j,k)” the ideal would be “do concurrent (k=…, j=…, ,i=…)”. Though like OpenACC’s “kernels” directive, the compiler has the freedom to interchange loops so likely why you see less variance between the schedules.

Given nblk is 1, it makes sense why using it in the stride-1 dimension gives worse performance. Unless it’s a parameter, the compiler wont know nblk is 1 and hence will likely schedule it across the vectors (CUDA threads).

For the CPU side, I don’t think the compiler is able to perform the loop reordering so doing a straight OpenMP collapse(3). While I’m not entirely sure, but given your data it appears that the data blocking is done inner out, so why putting np before nt gives better results. Given nblk=1, it’s placement has less effect. CPU vectorization may be coming into play as well if the memory blocks are contiguous.

Again, if you can send me the code, I can do a better analysis given I haven’t looked at this before. Plus, I can then pass it on to the compiler team for performance opportunities.

-Mat

Hello Mat,

Sorry, I misspoke. It’s not a version of Pot3D. I will have to check if I am allowed to share the code first. In the meantime, this does help in my understanding.

Best,
Miko

Sorry to bump this issue. I was also wondering about this.

In one of the Nvidia blog-posts (https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk), the opposite order is used:

27      do concurrent(i=2 : n-1, j=2 : m-1)
28        a(i,j) = w0 * b(i,j) + &
29                 w1 * (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1)) + &
30                 w2 * (b(i-1,j-1) + b(i-1,j+1) + b(i+1,j-1) + b(i+1,j+1))
31      enddo

I noticed one does get slightly different optimization information for the two major choices (-Minfo=stdpar).

For a Fortran-array indexed as (i,j,k), when I have a triple loop like this,

      do concurrent(k=1:8,j=1:ny,i=1:nx)
! ...
      end do

I get the message,

     88, Generating NVIDIA GPU code
         88,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x

But when I re-order the loop as i=1:nx,j=1:ny,k=1:8, the message changes to,

     88, Generating NVIDIA GPU code
         88,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost

In practice, just like Miko, I see no noticeable difference on the GPU performance.

The order matters on CPU when using other compilers (see for instance Intel).