parallelize inner loop

WENYANG_LIU · October 14, 2010, 12:48pm

Hi,
I am trying to parallelize only the inner loop as following:

program innerloop

implicit none 
integer::i,j,f(3),f_j(10)

 
f=0


!$acc data region local(f_j),copy(f)

do i=1,3
!$acc region 
   do j=1,10 
      f_j(j)=i*j
   end do 

   f(i)=sum(f_j)
!$acc end region     
end do 


!$acc end data region


write(*,*)f

end program

However, the output of “f” is “0 0 0”. While the correct one should be " 55 110 165".
Can anyone point out my mistake? The pgfortran version is 10.3.

MatColgrove · October 14, 2010, 11:05pm

Hi WENYANG LIU,

This looks like it may be more of a compiler issue. I have sent a report to our engineers (TPR#17286) for further investigation.

The work around is to either remove the “copy(f)” clause from the data region directive or modify the code as follows:

% cat tmp2.f90 
program innerloop

implicit none
integer::i,j
real :: f(3),f_j(10)
 
f=0

!$acc region local(f_j)
!$acc do host
do i=1,3
   do j=1,10
      f_j(j)=i*j
   end do
   f(i)=sum(f_j)
end do
!$acc end region     

write(*,*)f

end program 
% pgf90 -ta=nvidia -Minfo -V10.3 tmp2.f90 ; a.out
innerloop:
      9, Generating local(f_j(:))
         Generating compute capability 1.0 kernel
         Generating compute capability 1.3 kernel
     11, Parallelization would require privatization of array 'f_j(1:10)'
         Sequential loop scheduled on host
     12, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(10)
     15, sum reduction inlined
         Loop is parallelizable
         Accelerator kernel generated
         15, !$acc do parallel, vector(10)
             Sum reduction generated for f_j$r
    55.00000        110.0000        165.0000

Thanks,
Mat

WENYANG_LIU · October 15, 2010, 1:42am

Hi Mkcolg,

Thanks for your reply.
I have a question regarding the modified code you provided:
Since “i-loop” is on host, why is Sum reduction generated for f_j on line 15?

MatColgrove · October 15, 2010, 9:09pm

Hi WENYANG LIU,

Only the outer i loop is scheduled on the host. The two inner loops (sum is really a loop) are accelerated with the compiler generating a kernel for each. Do you not want the sum reduction parallelized?

Mat

MatColgrove · July 14, 2011, 6:28pm

Hi WENYANG LIU,

TPR#17286 was fixed in the 11.6 release when we added support for scalar kernels. The problem here was that “f(i)=sum(f_j)” gets transformed into:

tmp = 0
do j = 1,10
tmp = tmp + f_j(j)
enddo
f(i) = tmp

While the do loop is scheduled on the device, the final update “f(i) = tmp” had to be performed on the host. However, since you copy back “f” at the end of the data region, the host values get overwritten.

Adding support for scalar kernels in 11.6 allows for “f(i) = tmp” to be executed on the device.

Thanks,
Mat

Topic		Replies	Views
loop is parallelizable Legacy PGI Compilers	3	4346	October 20, 2010
Need help to accelerate Legacy PGI Compilers	3	2640	November 26, 2012
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4234	December 6, 2012
Reduction not recognized in Fortran Legacy PGI Compilers	6	3396	June 1, 2012
Six Loops iteration and reduction Legacy PGI Compilers	15	7981	March 27, 2012
accelerator parallization issues Legacy PGI Compilers	18	26840	April 12, 2010
Complex loop carried dependence of 'd' Legacy PGI Compilers	5	20442	September 29, 2009
PGI attempts to parallelize sequential loop Legacy PGI Compilers	3	2637	August 28, 2012
Loop is parallelizable Legacy PGI Compilers	2	1806	June 10, 2010
Problem accelerating nested arrays Legacy PGI Compilers	5	7145	August 4, 2010

parallelize inner loop

Related topics