should use to "acc reduction" in an inner loop

Hi All,

I am a new user of pgi acelerator. I try to use openacc for a matirx multiplication, such as

!\$ACC DATA COPYIN(D,u,v,w)
!\$ACC& COPYOUT(m)
!\$ACC PARALLEL LOOP
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
\$ + D(l,j)*v(i,l,k)
\$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!\$ACC END PARALLEL LOOP
!\$ACC END DATA

331, Complex loop carried dependence of ‘m’’ prevents parallelization Loop carried reuse of ‘m’ prevents parallelization

Is it means that m(i,j,k) has been deal with reduction operation in the inner loop “do l=1,n” in the pgi compiler ? Is it NOT necessary to do :

tmp = 0.0
\$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
\$ + D(l,j)*v(i,l,k)
\$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp

Thank you very much for your help

Hi jigo3635,

The unable to parallelize message applies to the “l” loop since the same values of “m” are updated for each iteration of the loop. So, yes, you would need to use the reduction clause with a scalar to get this loop to accelerate.

However since you ave three levels of outer loops, you may be better off only scheduling these loop and having the reduction loop performed sequentially. As you have it now, the “k” loop would be scheduled as the “gang” and “l” would be your “vector”. “j” and “i” are run sequentially within a “gang”. Since you can have two dimensions in the “gang”, you could collapse “k” and “j” together, but “i” would still be sequential. So not only have you reduced the amount of parallelism, there is additional overhead of setting up a reduction.

What I’d do is experiment with the schedule or use the “kernel” construct instead of “parallel” and let the compiler figure out the best schedule.

Some ideas:

``````!\$ACC DATA COPYIN(D,u,v,w)
!\$ACC& COPYOUT(m)
!\$ACC KERNELS
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
\$ + D(l,j)*v(i,l,k)
\$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
!\$ACC END KERNELS
``````

``````!\$ACC DATA COPYIN(D,u,v,w)
!\$ACC& COPYOUT(m)
!\$ACC PARALLEL LOOP COLLAPSE(2)
do k=1,n
do j=1,n
do i=1,n
tmp = 0.
!\$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
\$ + D(l,j)*v(i,l,k)
\$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp
enddo
enddo
``````

``````!\$ACC DATA COPYIN(D,u,v,w)
!\$ACC& COPYOUT(m)
!\$ACC PARALLEL
!\$ACC LOOP GANG
do k=1,n
!\$ACC LOOP GANG VECTOR
do j=1,n
!\$ACC LOOP VECTOR
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
\$ + D(l,j)*v(i,l,k)
\$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!\$ACC END PARALLEL
!\$ACC END DATA
``````

Hope this helps,
Mat

Hi Mat,

Thank you very much for your responses.

Perhaps there is a typo in version 3 of the code.

``````...
!\$ACC LOOP GANG
do k=1,n
!\$ACC LOOP GANG VECTOR
do j=1,n
!\$ACC LOOP VECTOR
``````

“!\$ACC LOOP GANG VECTOR” should be written “!\$ACC LOOP WORKER” with PGI compiler otherwise the code cannot be compiled.

Though I rewritten this it seems that this code works with Cray compiler but NOT pgi compiler on a Cray XE6 machine. My test code is below.

``````     program test_axf

c      implicit none
integer, parameter :: n=10
integer :: i, j, k, l
real :: us,ur,ut
real, dimension(:,:),allocatable :: D
real, dimension (:,:,:), allocatable :: w, u

real :: summ
allocate (D(n,n))
allocate (w(n,n,n), u(n,n,n) )

D = 0.0
w = 0.0
u = 0.0

do j = 1,n
do i = 1,n
D(i,j) = 1.0
enddo
enddo

do k = 1,n
do j = 1,n
do i = 1,n
u(i,j,k) = 1.0
enddo
enddo
enddo

!\$ACC DATA COPYIN(g,D)
!\$ACC& COPY(w,u)
call ax3f(w,u,ur,us,ut,n,D)
!\$ACC WAIT
!\$ACC END DATA

summ = 0.0
do k = 1,n
do j = 1,n
do i = 1,n
summ = summ + w(i,j,k)
enddo
enddo
enddo

write(*,*) "SUMMM= ", summ

deallocate (D,w,u)
contains

c-----------------------------------------------------------------------
subroutine ax3f(w,u,ur,us,ut,n,D)
real w (n,n,n), u (n,n,n), D(n,n)
real ur,us,ut,wtmp
integer i,j,k,l,e

!\$ACC DATA PRESENT(u)
!\$ACC& PRESENT(w)
!\$ACC& PRESENT(g,D)
!\$ACC  PARALLEL
!\$ACC LOOP gang
do k=1,n
!\$ACC LOOP WORKER
do j=1,n
!\$ACC LOOP VECTOR
do i=1,n
w(i,j,k) = 0.
do l=1,n
w(i,j,k) = w(i,j,k) + D(i,l)*u(l,j,k)
\$                 + D(i,l)*u(i,l,k)
\$                 + D(i,l)*u(i,j,l)
enddo
enddo
enddo
enddo
!\$acc end parallel
!\$ACC end data

return
end
end
``````

I just wonder if there are any differences between pgi and cray compiler using OpenACC syntax.

Thanks again.

Regards, Jin

Perhaps there is a typo in version 3 of the code.

Yes, I was mixing Kernels loop scheduling within a parallel construct. I should have used “kernels” or the following schedule where the outer loops are collapsed.

``````!\$ACC PARALLEL
!\$ACC LOOP gang collapse(2)
do k=1,n
do j=1,n
!\$ACC LOOP VECTOR
do i=1,n
w(i,j,k) = 0.
``````

or

``````!\$ACC kernels
!\$ACC LOOP gang
do k=1,n
!\$ACC LOOP gang vector
do j=1,n
!\$ACC LOOP VECTOR
do i=1,n
``````

The “Worker” schedule on NVIDIA corresponds to a Warp (a group of 32 threads) and not configurable by a user. I think Cray has a different interruption of what a Worker is. We and the other OpenACC members are trying to work out these implementation differences but in the meantime, try one of the above schedules and see if Cray matches our.

Personally, I’d use the “kernels” construct without any loop schedules and let the compiler determine the best schedule.

``````!\$ACC kernels
do k=1,n
do j=1,n
do i=1,n
``````
• Mat

Hi Mat,

Now It works fine with pgi compiler.

Thank you very much for your help.

/Jin