I have the following piece of 4-nested loop:
do i = 1, M
do j = 1, N
do k = 1, O
do l = 1, P
…
temp(1:Q) = …
out(1:Q, l, k, j, i) = temp
…
enddo
enddo
enddo
enddo
Now,
- M, N are < 10
- O, P, Q are 128
- we need temp to be private.
So I did this:
!$acc parallel copyout(out) private(temp)
!$acc loop seq
do i = 1, M
!$acc loop seq
do j = 1, N
!$acc loop vector
do k = 1, O
!$acc loop
do l = 1, P
…
temp(1:Q) = …
out(1:Q, l, k, j, i) = temp
…
enddo
!$acc end loop
enddo
!$acc end loop
enddo
!$acc end loop
enddo
!$acc end loop
!$acc end parallel
After some effort, I am getting correct results and some speedup, but
had some basic questions:
(1) How is the private-clause in parallel-construct correct? Because
it just replicates across the gangs, isn’t it?
(2) If i put the private-clause in the loop-construct before the
k-loop, i am getting incorrect result.
Shouldn’t this give the correct result and (1) give incorrect result?
(3) I notice (-Minfo=all) that there is a loop-gang added for the
l-loop by the compiler.
82, Generating Tesla code
84, !$acc loop seq
86, !$acc loop seq
88, !$acc loop vector(128) ! threadidx%x
90, !$acc loop gang ! blockidx%x
Okay fine. But if i reverse the 2 (88, 90):
- have a loop-gang for k-loop
- have a loop-vector for l-loop,
I get incorrect answer. Why?
(4) Finally what is the best way to put the loop-constructs, given the
values of M,N,O,P?
Best I am getting is 3x speedup only.
(5) Is 5-D array for “out” an issue of concern for speedup?
I can send the actual code (only the innermost body is big) and
pgfortran output, if the (1)-(3) behavior are not expected for the
code-snippet.
Thanks,
arun