Hello,

I am trying to parallelize a code with DO CONCURRENT and I am trying to find the correct order for allocating and looping through indices with DO CONCURRENT. For now, I am having nblk =1. Here is an example loop from the code :

```
allocate (flux_t(nblk,nt,npm))
do concurrent (i=1:nblk,k=1:npm,j=3:nt-2)
!
p0m = (one + D_C_MCt(j-1))*LN(i,j-1,k) - D_C_MCt(j-1)*LN(i,j-2,k)
p1m = D_C_MCt(j ) *LN(i,j-1,k) + D_C_CPt(j-1)*LN(i,j ,k)
p0p = (one + D_C_CPt(j ))*LP(i,j ,k) - D_C_CPt(j )*LP(i,j+1,k)
p1p = D_C_CPt(j-1) *LP(i,j ,k) + D_C_MCt(j )*LP(i,j-1,k)
!
B0m = four*(D_C_MCt(j-1)*(LN(i,j-1,k) - LN(i,j-2,k)))**2
B1m = four*(D_C_CPt(j-1)*(LN(i,j ,k) - LN(i,j-1,k)))**2
B0p = four*(D_C_CPt(j )*(LP(i,j+1,k) - LP(i,j ,k)))**2
B1p = four*(D_C_MCt(j )*(LP(i,j ,k) - LP(i,j-1,k)))**2
!
w0m = D_P_Tt(j-1) *(one/(weno_eps + B0m)**2)
w1m = D_MC_Tt(j-1)*(one/(weno_eps + B1m)**2)
w0p = D_M_Tt(j) *(one/(weno_eps + B0p)**2)
w1p = D_CP_Tt(j)*(one/(weno_eps + B1p)**2)
!
wm_sum = w0m + w1m
wp_sum = w0p + w1p
!
OM0m = w0m/wm_sum
OM1m = w1m/wm_sum
OM0p = w0p/wp_sum
OM1p = w1p/wp_sum
!
um = OM0m*p0m + OM1m*p1m
up = OM0p*p0p + OM1p*p1p
!
flux_t(i,j,k) = up + um
!
enddo
```

I looked at two ways of allocating the arrays. I tried the order of (nblk,nt,npm) and (nt,npm,nblk), and I looked at all the ways of looping through the indices. I find changing the order within the DO CONCURRENT has a drastic change on performance. I am trying to figure out the correct memory stride. Here are some of my timings in seconds for the entire POT3D code with consistent loops throughout the code:

On the CPU in serial with nblk = 1:

```
******* Allocate (nblk,nt,npm) ####################
Triple nested do loop (nest order : npm, nt, nblk) : 465.57
DO CONCURRENT (np, nt, nblk) : 421.21
DO CONCURRENT (nblk, np, nt) : 210.69
DO CONCURRENT (nblk, nt, np) : 610.36
DO CONCURRENT (nt, np, nblk) : 698.40
DO CONCURRENT (nt, nblk, np) : 615.77
DO CONCURRENT (np, nblk, nt) : 214.16
******* Allocate (nt,npm,nblk) ####################
Triple nested do loop (nest order : nblk, npm, nt) : 170.41
DO CONCURRENT (np, nt, nblk) : 344.41
DO CONCURRENT (nblk, np, nt) : 178.87
DO CONCURRENT (nblk, nt, np) : 587.08
DO CONCURRENT (nt, np, nblk) : 654.52
DO CONCURRENT (nt, nblk, np) : 599.88
DO CONCURRENT (np, nblk, nt) : 180.46
```

On the GPU with nblk = 1:

```
******* Allocate (nblk,nt,npm) ####################
DO CONCURRENT (np, nt, nblk) : 9.57
DO CONCURRENT (nblk, np, nt) : 9.53
DO CONCURRENT (nblk, nt, np) : 11.02
DO CONCURRENT (nt, np, nblk) : 11.22
DO CONCURRENT (nt, nblk, np) : 11.18
DO CONCURRENT (np, nblk, nt) : 9.55
******* Allocate (nt,npm,nblk) ####################
DO CONCURRENT (np, nt, nblk) : 9.39
DO CONCURRENT (nblk, np, nt) : 9.40
DO CONCURRENT (nblk, nt, np) : 9.39
DO CONCURRENT (nt, np, nblk) : 9.48
DO CONCURRENT (nt, nblk, np) : 9.41
DO CONCURRENT (np, nblk, nt) : 9.36
```

I also checked if the same happens with gfortran on the CPU when I run in serial. I found that when I allocate (nblk,nt,npm), no matter what the DO CONCURRENT order is the timing is always roughly 360s, and when I allocate (nt,npm,nblk), no matter what the DO CONCURRENT order is the timing is always roughly 189s.

Any help or guidance in the correct memory stride with DO CONCURRENT would be appreciated. I find it interesting that the GPU timings tend to be consistent, while the serial CPU timings vary wildly in timings.

- Miko