Hi,
I am trying to parallelize only the inner loop as following:
program innerloop
implicit none
integer::i,j,f(3),f_j(10)
f=0
!$acc data region local(f_j),copy(f)
do i=1,3
!$acc region
do j=1,10
f_j(j)=i*j
end do
f(i)=sum(f_j)
!$acc end region
end do
!$acc end data region
write(*,*)f
end program
However, the output of “f” is “0 0 0”. While the correct one should be " 55 110 165".
Can anyone point out my mistake? The pgfortran version is 10.3.
Hi WENYANG LIU,
This looks like it may be more of a compiler issue. I have sent a report to our engineers (TPR#17286) for further investigation.
The work around is to either remove the “copy(f)” clause from the data region directive or modify the code as follows:
% cat tmp2.f90
program innerloop
implicit none
integer::i,j
real :: f(3),f_j(10)
f=0
!$acc region local(f_j)
!$acc do host
do i=1,3
do j=1,10
f_j(j)=i*j
end do
f(i)=sum(f_j)
end do
!$acc end region
write(*,*)f
end program
% pgf90 -ta=nvidia -Minfo -V10.3 tmp2.f90 ; a.out
innerloop:
9, Generating local(f_j(:))
Generating compute capability 1.0 kernel
Generating compute capability 1.3 kernel
11, Parallelization would require privatization of array 'f_j(1:10)'
Sequential loop scheduled on host
12, Loop is parallelizable
Accelerator kernel generated
12, !$acc do parallel, vector(10)
15, sum reduction inlined
Loop is parallelizable
Accelerator kernel generated
15, !$acc do parallel, vector(10)
Sum reduction generated for f_j$r
55.00000 110.0000 165.0000
Thanks,
Mat
Hi Mkcolg,
Thanks for your reply.
I have a question regarding the modified code you provided:
Since “i-loop” is on host, why is Sum reduction generated for f_j on line 15?
Hi WENYANG LIU,
Only the outer i loop is scheduled on the host. The two inner loops (sum is really a loop) are accelerated with the compiler generating a kernel for each. Do you not want the sum reduction parallelized?
Hi WENYANG LIU,
TPR#17286 was fixed in the 11.6 release when we added support for scalar kernels. The problem here was that “f(i)=sum(f_j)” gets transformed into:
tmp = 0
do j = 1,10
tmp = tmp + f_j(j)
enddo
f(i) = tmp
While the do loop is scheduled on the device, the final update “f(i) = tmp” had to be performed on the host. However, since you copy back “f” at the end of the data region, the host values get overwritten.
Adding support for scalar kernels in 11.6 allows for “f(i) = tmp” to be executed on the device.
Thanks,
Mat