Hi everyone,
I tired pgfortran v10.3 and v10.9 to compile my code.
The compilation message always contains:
Loop is parallelizable
But the corresponding loops are not actually parallelized.
Is this a problem with my code or the compiler?
Hi everyone,
I tired pgfortran v10.3 and v10.9 to compile my code.
The compilation message always contains:
Loop is parallelizable
But the corresponding loops are not actually parallelized.
Is this a problem with my code or the compiler?
Is this a problem with my code or the compiler?
Sorry, I’ll need more information. What other informational messages are printed? Can you post an example?
Hi Mat,
Here is an example:
program para
implicit none
integer::f(10),f_j(10),fsum(10)
integer::i,j
fsum=0
!$acc data region local(f),copy(fsum)
!$acc region
!$acc do private(f_j)
do i=1,10
f(i)=i
do j=1,10
f_j(j)=f(i)+10
end do
fsum(i)=sum(f_j)
end do
!$acc end region
!$acc end data region
write(*,*)fsum
end program
para:
8, Generating local(f(:))
Generating copy(fsum(:))
9, Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
11, Loop is parallelizable
Accelerator kernel generated
11, !$acc do parallel, vector(10)
CC 1.0 : 14 registers; 20 shared, 32 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 14 registers; 20 shared, 32 constant, 0 local memory bytes; 25 occupancy
13, Loop is parallelizable
16, sum reduction inlined
Loop is parallelizable
I used “do private” for i-loop based on the compilation message that it requires privatization of array ‘f_j(1:10)’.
Thanks.
Hi WENYANG LIU,
In this case, while the “j” loop and the sum reduction are parallelizable, the compiler had determined the optimal schedule is to only create a kernel for the “i” loop with the body of the loop containing the kernel. The only alternate schedule would be to break the loop into three kernels. In which case, you’ll need to rewrite the code a bit:
$ cat test.f90
program para
implicit none
integer::f(10),f_j(10,10),fsum(10)
integer::i,j
fsum=0
!$acc data region local(f),copy(fsum)
!$acc region
do i=1,10
f(i)=i
enddo
do i=1,10
do j=1,10
f_j(i,j)=f(i)+10
end do
enddo
do i=1,10
fsum(i)=sum(f_j(i,:))
end do
!$acc end region
!$acc end data region
write(*,*)fsum
end program
$ pgf90 -ta=nvidia -Minfo=accel test.f90
para:
9, Generating local(f(:))
Generating copy(fsum(:))
10, Generating copyout(f_j(1:10,1:10))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
11, Loop is parallelizable
Accelerator kernel generated
11, !$acc do parallel, vector(10)
CC 1.0 : 4 registers; 20 shared, 52 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 4 registers; 20 shared, 52 constant, 0 local memory bytes; 25 occupancy
14, Loop is parallelizable
15, Loop is parallelizable
Accelerator kernel generated
14, !$acc do parallel, vector(10)
Cached references to size [10] block of 'f'
15, !$acc do parallel, vector(10)
CC 1.0 : 6 registers; 64 shared, 52 constant, 0 local memory bytes; 100 occupancy
CC 1.3 : 6 registers; 64 shared, 52 constant, 0 local memory bytes; 100 occupancy
19, Loop is parallelizable
Accelerator kernel generated
19, !$acc do parallel, vector(10)
CC 1.0 : 8 registers; 20 shared, 48 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 8 registers; 20 shared, 48 constant, 0 local memory bytes; 25 occupancy
20, Loop is parallelizable
$ a.out
110 120 130 140 150 160
170 180 190 200