parallelize inner loop

Hi,
I am trying to parallelize only the inner loop as following:

program innerloop

implicit none 
integer::i,j,f(3),f_j(10)

 
f=0


!$acc data region local(f_j),copy(f)

do i=1,3
!$acc region 
   do j=1,10 
      f_j(j)=i*j
   end do 

   f(i)=sum(f_j)
!$acc end region     
end do 


!$acc end data region


write(*,*)f

end program

However, the output of “f” is “0 0 0”. While the correct one should be " 55 110 165".
Can anyone point out my mistake? The pgfortran version is 10.3.

Hi WENYANG LIU,

This looks like it may be more of a compiler issue. I have sent a report to our engineers (TPR#17286) for further investigation.

The work around is to either remove the “copy(f)” clause from the data region directive or modify the code as follows:

% cat tmp2.f90 
program innerloop

implicit none
integer::i,j
real :: f(3),f_j(10)
 
f=0

!$acc region local(f_j)
!$acc do host
do i=1,3
   do j=1,10
      f_j(j)=i*j
   end do
   f(i)=sum(f_j)
end do
!$acc end region     

write(*,*)f

end program 
% pgf90 -ta=nvidia -Minfo -V10.3 tmp2.f90 ; a.out
innerloop:
      9, Generating local(f_j(:))
         Generating compute capability 1.0 kernel
         Generating compute capability 1.3 kernel
     11, Parallelization would require privatization of array 'f_j(1:10)'
         Sequential loop scheduled on host
     12, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(10)
     15, sum reduction inlined
         Loop is parallelizable
         Accelerator kernel generated
         15, !$acc do parallel, vector(10)
             Sum reduction generated for f_j$r
    55.00000        110.0000        165.0000

Thanks,
Mat

Hi Mkcolg,

Thanks for your reply.
I have a question regarding the modified code you provided:
Since “i-loop” is on host, why is Sum reduction generated for f_j on line 15?

Hi WENYANG LIU,

Only the outer i loop is scheduled on the host. The two inner loops (sum is really a loop) are accelerated with the compiler generating a kernel for each. Do you not want the sum reduction parallelized?

  • Mat

Hi WENYANG LIU,

TPR#17286 was fixed in the 11.6 release when we added support for scalar kernels. The problem here was that “f(i)=sum(f_j)” gets transformed into:

tmp = 0
do j = 1,10
tmp = tmp + f_j(j)
enddo
f(i) = tmp

While the do loop is scheduled on the device, the final update “f(i) = tmp” had to be performed on the host. However, since you copy back “f” at the end of the data region, the host values get overwritten.

Adding support for scalar kernels in 11.6 allows for “f(i) = tmp” to be executed on the device.

Thanks,
Mat