Construct and clauses in a deeply nested loop

I have the following piece of 4-nested loop:

do i = 1, M
do j = 1, N
do k = 1, O
do l = 1, P

temp(1:Q) = …
out(1:Q, l, k, j, i) = temp

enddo
enddo
enddo
enddo

Now,

  • M, N are < 10
  • O, P, Q are 128
  • we need temp to be private.

So I did this:

!$acc parallel copyout(out) private(temp)

!$acc loop seq
do i = 1, M
!$acc loop seq
do j = 1, N
!$acc loop vector
do k = 1, O
!$acc loop
do l = 1, P

temp(1:Q) = …
out(1:Q, l, k, j, i) = temp

enddo
!$acc end loop
enddo
!$acc end loop
enddo
!$acc end loop
enddo
!$acc end loop

!$acc end parallel

After some effort, I am getting correct results and some speedup, but
had some basic questions:

(1) How is the private-clause in parallel-construct correct? Because
it just replicates across the gangs, isn’t it?

(2) If i put the private-clause in the loop-construct before the
k-loop, i am getting incorrect result.
Shouldn’t this give the correct result and (1) give incorrect result?

(3) I notice (-Minfo=all) that there is a loop-gang added for the
l-loop by the compiler.
82, Generating Tesla code
84, !$acc loop seq
86, !$acc loop seq
88, !$acc loop vector(128) ! threadidx%x
90, !$acc loop gang ! blockidx%x

Okay fine. But if i reverse the 2 (88, 90):
- have a loop-gang for k-loop
- have a loop-vector for l-loop,
I get incorrect answer. Why?

(4) Finally what is the best way to put the loop-constructs, given the
values of M,N,O,P?
Best I am getting is 3x speedup only.

(5) Is 5-D array for “out” an issue of concern for speedup?

I can send the actual code (only the innermost body is big) and
pgfortran output, if the (1)-(3) behavior are not expected for the
code-snippet.

Thanks,
arun

(1) How is the private-clause in parallel-construct correct? Because
it just replicates across the gangs, isn’t it?

Correct, a private clause on the parallel construct (as opposed a loop construct), will apply to the gang loop(s). Most likely why the compiler is scheduling the "l loop as gang since “temp” needs to be private to “l”.

(2) If i put the private-clause in the loop-construct before the
k-loop, i am getting incorrect result.

Most likely because “temp” needs to be private to the “l” loop.

Shouldn’t this give the correct result and (1) give incorrect result?

No, again the compiler is making “l” the gang loop, hence “temp” is privatized correctly.

(3) I notice (-Minfo=all) that there is a loop-gang added for the
l-loop by the compiler. … But if i reverse the 2 (88, 90) I get incorrect answer. Why?

Most likely the same reason in that “temp” needs to be private to the “l” loop. You’d want to move the “private” clause to the “l” loop’s “loop vector” directive instead of having it on the parallel construct.

(5) Is 5-D array for “out” an issue of concern for speedup?

Since Fortran arrays are contiguous in memory, the number of dimensions doesn’t matter. But in general, it’s best to manage arrays, via an unstructured data region, at higher levels in the code and then use the “update” directive to synchronize data movement. As you offload more of the code to the device, you can then start to minimize the data movement.

I can send the actual code (only the innermost body is big)

It would be helpful so I can give you better advice as to what loop schedule would be optimal.

Though in general, you want the “vector” loop to be the same as the contiguous (stride-1) dimension. In Fortran, this is the first dimension. Hence, given just this snip-it, I’d try collapsing the outer 4 loops using a gang schedule and then let the compiler schedule the implicit array syntax loops as “vector” (or turn the array syntax into explicit loops and manually add the “loop vector” directive). Plus, “temp” will most likely be put into shared memory assuming the size of “temp” is known, and isn’t too big.

Something like:

!$acc parallel loop gang collapse(4) private(temp)
do i = 1, M
do j = 1, N
do k = 1, O
do l = 1, P
…
temp(1:Q) = …
out(1:Q, l, k, j, i) = temp
…

The caveat being I don’t know what other code is in the “l” and the non-array syntax lines will be executed by a single vector. Plus, a barrier will be added between vector loops so if there’s many of them, this may cause a slow-down.

Are you able to change the layout of “out”? If so, then you may be better off with something like:

!$acc parallel loop gang collapse(3) 
do i = 1, M
do j = 1, N
do k = 1, O
!$acc loop vector private(temp)
do l = 1, P
…
temp(1:Q) = …
out(l, 1:Q, k, j, i) = temp
…

-Mat

Thank you Mat. With your explanation (1),(2),(3) are resolved.
Basically you are saying, although the “private” is on parallel-construct, it applies to loop-gangs also.

Pl find attached the code-file and the pgfortran-output.
nested_loop_nvidia.txt (3.6 KB)
pgfortra_output_nvidia.txt (2.2 KB)

I am afraid you cannot run it, as it is part of a multi-file project. Nevertheless it should give you the idea.
I have included the required variable declarations in the top.
Also the line.84 in pgfotran-output corresponds to " do prim=1,nvars" in the code-file.
I didn’t understand the loop-dependence warning for line-84,86.

Thanks,
arun

I wouldn’t say “also” but rather that private when added to a “parallel” construct only applies to the gangs. Here’s the language from the OpenACC 3.0 spec:

969 2.5.11. private clause
970 The private clause is allowed on the parallel and serial constructs; it declares that a copy
971 of each item on the list will be created for each gang.

I didn’t understand the loop-dependence warning for line-84,86.>
84, Loop carried dependence due to exposed use of amlocal(:),aclocal(:),aplocal(:),xout(:),f(:) prevents parallelization
86, Loop carried dependence due to exposed use of amlocal(:),aclocal(:),aplocal(:),xout(:),f(:) prevents parallelization

These are because parallelization of these loops would require privatization of the arrays.

In looking a your source, my first test would be to collapse the outer 4 loop to use gang and then vectorize the inner “i” loops. The compiler will implicitly vectorize the array syntax, but you might put F and xout’s initialization into an explicit loop so they’re combined, thus removing a thread synchronization point. Something like:

!$acc parallel loop gang collapse(4) private(F,xout,APlocal,AMlocal,AClocal)
       do prim=1,nvars
        do nbl=1,nblocks
         do k=1,128 !NK(nbl)
          do j=1,128 !NJ(nbl)
!$acc loop vector
           do i=1, NIMax
                 F(i) = 0.0d
                 xout(i) = 0.0d
           enddo       
!$acc loop vector
           do i=3,NI(nbl)-2
.... etc ...

The second test would be what I suggested before which is to collapse the outer three loops and vectorize the “j” loop, Though I’d high suggest making “j” the stride-1 dimension for PHI and PHID requiring you change the layout of the arrays a bit.

  real(dp),dimension(NJmax,NImax,NKmax,nblocks,nvars) :: PHI, PHID
... 
!$acc parallel loop gang collapse(3)
   do prim=1,nvars
    do nbl=1,nblocks
     do k=1,128 !NK(nbl)
!$acc loop vector private(F,xout,APlocal,AMlocal,AClocal)
      do j=1,128 !NJ(nbl)
...
       do i=3,NI(nbl)-2
         PHID(j,i, k,nbl,prim)=(b4*(PHI(j, i+2,k,nbl,prim)- &
                              PHI(j,i-2,k,nbl,prim)))+ &
         (a2*(PHI(j, i+1,k,nbl,prim)-PHI(j, i-1,k,nbl,prim)))
       enddo

The first suggestion will allow for the private arrays to be placed in shared memory, increase the amount of parallelism used, and not require you to change the data layout. Though it will add more thread synchronization between the multiple vector loops and some sections (the boundary setting) will only be run by a single vector.

The second will use fewer total threads, but give each vector (CUDA thread) more work and reduce synchronization. More memory will be used since now each thread will have it’s own copy of the private arrays, but hopefully the arrays will fit in L2 cache so wont hurt much.

Since I can’t run it, I have no idea which will be better performance-wise, but it’s easy to the change directives so simple to try both to compare (though the data layout change is a bit more cumbersome).

Thank you Mat.
Tried your suggestions:
(1) With collapse(4), compiler is automatically doing what suggested (parallelizing for F,xout, and combining (fused) them).
(2) collapse(3) with vector for last (j) loop.

But results (timing) are very similar.

Changing the index-ordering in 5-d arrays is to be tried, as it requires considerable changes.

Anyway, thanks for the hints.

arun

Good to know, though the key to the second version is to get the stride-1 dimension of the arrays to match the vector loop. Though it’s certainly understandable if you can’t make this change. Personally I tend to opt for portability with good performance over intrusive changes to achieve optimal performance since it’s better for long term maintenance. So if option #1 gets you good performance, no need to make the change to the arrays. .