should use to "acc reduction" in an inner loop

gongjing · November 22, 2012, 11:04am

Hi All,

I am a new user of pgi acelerator. I try to use openacc for a matirx multiplication, such as

…
!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL LOOP
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!$ACC END PARALLEL LOOP
!$ACC END DATA

when compile the code, I obtain the following message

331, Complex loop carried dependence of ‘m’’ prevents parallelization
Loop carried reuse of ‘m’ prevents parallelization

Is it means that m(i,j,k) has been deal with reduction operation in the inner loop “do l=1,n” in the pgi compiler ? Is it NOT necessary to do :

…
tmp = 0.0
$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp
…

Thank you very much for your help

MatColgrove · November 26, 2012, 5:41pm

Hi jigo3635,

The unable to parallelize message applies to the “l” loop since the same values of “m” are updated for each iteration of the loop. So, yes, you would need to use the reduction clause with a scalar to get this loop to accelerate.

However since you ave three levels of outer loops, you may be better off only scheduling these loop and having the reduction loop performed sequentially. As you have it now, the “k” loop would be scheduled as the “gang” and “l” would be your “vector”. “j” and “i” are run sequentially within a “gang”. Since you can have two dimensions in the “gang”, you could collapse “k” and “j” together, but “i” would still be sequential. So not only have you reduced the amount of parallelism, there is additional overhead of setting up a reduction.

What I’d do is experiment with the schedule or use the “kernel” construct instead of “parallel” and let the compiler figure out the best schedule.

Some ideas:

!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC KERNELS
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo 
!$ACC END KERNELS

!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL LOOP COLLAPSE(2)
do k=1,n
do j=1,n
do i=1,n
tmp = 0.
!$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp
enddo
enddo

!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL
!$ACC LOOP GANG
do k=1,n
!$ACC LOOP GANG VECTOR
do j=1,n
!$ACC LOOP VECTOR
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!$ACC END PARALLEL 
!$ACC END DATA

Hope this helps,
Mat

gongjing · December 4, 2012, 4:40pm

Hi Mat,

Thank you very much for your responses.

Perhaps there is a typo in version 3 of the code.

...
!$ACC LOOP GANG
do k=1,n
!$ACC LOOP GANG VECTOR
do j=1,n
!$ACC LOOP VECTOR

“!$ACC LOOP GANG VECTOR” should be written “!$ACC LOOP WORKER” with PGI compiler otherwise the code cannot be compiled.

Though I rewritten this it seems that this code works with Cray compiler but NOT pgi compiler on a Cray XE6 machine. My test code is below.

     program test_axf

c      implicit none                                                                                          
      integer, parameter :: n=10
      integer :: i, j, k, l
      real :: us,ur,ut
      real, dimension(:,:),allocatable :: D
	real, dimension (:,:,:), allocatable :: w, u

	real :: summ
	allocate (D(n,n))
       allocate (w(n,n,n), u(n,n,n) )

      D = 0.0
      w = 0.0
      u = 0.0

	do j = 1,n
	   do i = 1,n
            D(i,j) = 1.0
         enddo
      enddo

	do k = 1,n
         do j = 1,n
            do i = 1,n
              u(i,j,k) = 1.0
           enddo
        enddo
      enddo

!$ACC DATA COPYIN(g,D)                                                                                        
!$ACC& COPY(w,u)                                                                                              
	call ax3f(w,u,ur,us,ut,n,D)
!$ACC WAIT                                                                                                    
!$ACC END DATA                                                                                                

	summ = 0.0
      do k = 1,n
         do j = 1,n
            do i = 1,n
		summ = summ + w(i,j,k)
           enddo
        enddo
      enddo

      write(*,*) "SUMMM= ", summ

      deallocate (D,w,u)
      contains

c-----------------------------------------------------------------------                                      
      subroutine ax3f(w,u,ur,us,ut,n,D) 
      real w (n,n,n), u (n,n,n), D(n,n)
	real ur,us,ut,wtmp
      integer i,j,k,l,e

!$ACC DATA PRESENT(u)                                                                                         
!$ACC& PRESENT(w)                                                                                             
!$ACC& PRESENT(g,D)                                                                                           
!$ACC  PARALLEL                                                                                               
!$ACC LOOP gang                                                                                               
	do k=1,n
!$ACC LOOP WORKER                                                                                             
         do j=1,n
!$ACC LOOP VECTOR                                                                                             
            do i=1,n
	         w(i,j,k) = 0.
               do l=1,n
                  w(i,j,k) = w(i,j,k) + D(i,l)*u(l,j,k)
     $                 + D(i,l)*u(i,l,k)
     $                 + D(i,l)*u(i,j,l)
               enddo
            enddo
         enddo
      enddo
!$acc end parallel                                                                                            
!$ACC end data                                                                                                

      return
      end
     end

I just wonder if there are any differences between pgi and cray compiler using OpenACC syntax.

Thanks again.

Regards, Jin

MatColgrove · December 4, 2012, 5:56pm

Perhaps there is a typo in version 3 of the code.

Yes, I was mixing Kernels loop scheduling within a parallel construct. I should have used “kernels” or the following schedule where the outer loops are collapsed.

!$ACC PARALLEL
!$ACC LOOP gang collapse(2)
do k=1,n
         do j=1,n
!$ACC LOOP VECTOR 
do i=1,n
            w(i,j,k) = 0.

or

!$ACC kernels
!$ACC LOOP gang  
 do k=1,n
!$ACC LOOP gang vector 
do j=1,n
!$ACC LOOP VECTOR 
do i=1,n

The “Worker” schedule on NVIDIA corresponds to a Warp (a group of 32 threads) and not configurable by a user. I think Cray has a different interruption of what a Worker is. We and the other OpenACC members are trying to work out these implementation differences but in the meantime, try one of the above schedules and see if Cray matches our.

Personally, I’d use the “kernels” construct without any loop schedules and let the compiler determine the best schedule.

!$ACC kernels 
do k=1,n 
      do j=1,n  
           do i=1,n

Mat

gongjing · December 6, 2012, 10:20am

Hi Mat,

Now It works fine with pgi compiler.

Thank you very much for your help.

/Jin

Topic		Replies	Views
MatMul with openACC Legacy PGI Compilers	7	13010	December 17, 2012
Performance of pgi openaccfor a matrix-matrix multiplication Legacy PGI Compilers	2	4730	May 1, 2014
Significant deterioration of performance with array reduction in OpenACC Legacy PGI Compilers	7	1015	April 22, 2022
Reduction results in wrong results. Bug? Legacy PGI Compilers	8	7635	January 24, 2014
How to parallelize this loop... Legacy PGI Compilers	14	7811	December 18, 2012
OpenACC reductions Legacy PGI Compilers	1	2461	March 26, 2012
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10415	May 2, 2014
#pragma acc kernels loop Versus #pragma acc parallel loop Legacy PGI Compilers	3	10497	June 1, 2015
openacc routine function efficiency Legacy PGI Compilers	1	3272	July 2, 2018
[Help] Using reduction with Array Legacy PGI Compilers	14	3130	March 21, 2024

should use to "acc reduction" in an inner loop

when compile the code, I obtain the following message

331, Complex loop carried dependence of ‘m’’ prevents parallelization Loop carried reuse of ‘m’ prevents parallelization

Related topics

331, Complex loop carried dependence of ‘m’’ prevents parallelization
Loop carried reuse of ‘m’ prevents parallelization