should use to "acc reduction" in an inner loop

Hi All,

I am a new user of pgi acelerator. I try to use openacc for a matirx multiplication, such as


!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL LOOP
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!$ACC END PARALLEL LOOP
!$ACC END DATA

when compile the code, I obtain the following message

331, Complex loop carried dependence of ‘m’’ prevents parallelization
Loop carried reuse of ‘m’ prevents parallelization

Is it means that m(i,j,k) has been deal with reduction operation in the inner loop “do l=1,n” in the pgi compiler ? Is it NOT necessary to do :


tmp = 0.0
$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp

Thank you very much for your help

Hi jigo3635,

The unable to parallelize message applies to the “l” loop since the same values of “m” are updated for each iteration of the loop. So, yes, you would need to use the reduction clause with a scalar to get this loop to accelerate.

However since you ave three levels of outer loops, you may be better off only scheduling these loop and having the reduction loop performed sequentially. As you have it now, the “k” loop would be scheduled as the “gang” and “l” would be your “vector”. “j” and “i” are run sequentially within a “gang”. Since you can have two dimensions in the “gang”, you could collapse “k” and “j” together, but “i” would still be sequential. So not only have you reduced the amount of parallelism, there is additional overhead of setting up a reduction.

What I’d do is experiment with the schedule or use the “kernel” construct instead of “parallel” and let the compiler figure out the best schedule.

Some ideas:

!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC KERNELS
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo 
!$ACC END KERNELS



!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL LOOP COLLAPSE(2)
do k=1,n
do j=1,n
do i=1,n
tmp = 0.
!$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp
enddo
enddo



!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL
!$ACC LOOP GANG
do k=1,n
!$ACC LOOP GANG VECTOR
do j=1,n
!$ACC LOOP VECTOR
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!$ACC END PARALLEL 
!$ACC END DATA

Hope this helps,
Mat

Hi Mat,

Thank you very much for your responses.

Perhaps there is a typo in version 3 of the code.

...
!$ACC LOOP GANG
do k=1,n
!$ACC LOOP GANG VECTOR
do j=1,n
!$ACC LOOP VECTOR

“!$ACC LOOP GANG VECTOR” should be written “!$ACC LOOP WORKER” with PGI compiler otherwise the code cannot be compiled.

Though I rewritten this it seems that this code works with Cray compiler but NOT pgi compiler on a Cray XE6 machine. My test code is below.

     program test_axf

c      implicit none                                                                                          
      integer, parameter :: n=10
      integer :: i, j, k, l
      real :: us,ur,ut
      real, dimension(:,:),allocatable :: D
	real, dimension (:,:,:), allocatable :: w, u

	real :: summ
	allocate (D(n,n))
       allocate (w(n,n,n), u(n,n,n) )

      D = 0.0
      w = 0.0
      u = 0.0

	do j = 1,n
	   do i = 1,n
            D(i,j) = 1.0
         enddo
      enddo

	do k = 1,n
         do j = 1,n
            do i = 1,n
              u(i,j,k) = 1.0
           enddo
        enddo
      enddo

!$ACC DATA COPYIN(g,D)                                                                                        
!$ACC& COPY(w,u)                                                                                              
	call ax3f(w,u,ur,us,ut,n,D)
!$ACC WAIT                                                                                                    
!$ACC END DATA                                                                                                

	summ = 0.0
      do k = 1,n
         do j = 1,n
            do i = 1,n
		summ = summ + w(i,j,k)
           enddo
        enddo
      enddo

      write(*,*) "SUMMM= ", summ

      deallocate (D,w,u)
      contains

c-----------------------------------------------------------------------                                      
      subroutine ax3f(w,u,ur,us,ut,n,D) 
      real w (n,n,n), u (n,n,n), D(n,n)
	real ur,us,ut,wtmp
      integer i,j,k,l,e

!$ACC DATA PRESENT(u)                                                                                         
!$ACC& PRESENT(w)                                                                                             
!$ACC& PRESENT(g,D)                                                                                           
!$ACC  PARALLEL                                                                                               
!$ACC LOOP gang                                                                                               
	do k=1,n
!$ACC LOOP WORKER                                                                                             
         do j=1,n
!$ACC LOOP VECTOR                                                                                             
            do i=1,n
	         w(i,j,k) = 0.
               do l=1,n
                  w(i,j,k) = w(i,j,k) + D(i,l)*u(l,j,k)
     $                 + D(i,l)*u(i,l,k)
     $                 + D(i,l)*u(i,j,l)
               enddo
            enddo
         enddo
      enddo
!$acc end parallel                                                                                            
!$ACC end data                                                                                                

      return
      end
     end

I just wonder if there are any differences between pgi and cray compiler using OpenACC syntax.

Thanks again.

Regards, Jin

Perhaps there is a typo in version 3 of the code.

Yes, I was mixing Kernels loop scheduling within a parallel construct. I should have used “kernels” or the following schedule where the outer loops are collapsed.

!$ACC PARALLEL
!$ACC LOOP gang collapse(2)
do k=1,n
         do j=1,n
!$ACC LOOP VECTOR 
do i=1,n
            w(i,j,k) = 0.

or

!$ACC kernels
!$ACC LOOP gang  
 do k=1,n
!$ACC LOOP gang vector 
do j=1,n
!$ACC LOOP VECTOR 
do i=1,n

The “Worker” schedule on NVIDIA corresponds to a Warp (a group of 32 threads) and not configurable by a user. I think Cray has a different interruption of what a Worker is. We and the other OpenACC members are trying to work out these implementation differences but in the meantime, try one of the above schedules and see if Cray matches our.

Personally, I’d use the “kernels” construct without any loop schedules and let the compiler determine the best schedule.

!$ACC kernels 
do k=1,n 
      do j=1,n  
           do i=1,n
  • Mat

Hi Mat,

Now It works fine with pgi compiler.

Thank you very much for your help.

/Jin