loop is parallelizable

WENYANG_LIU · October 19, 2010, 6:44pm

Hi everyone,

I tired pgfortran v10.3 and v10.9 to compile my code.
The compilation message always contains:

Loop is parallelizable

But the corresponding loops are not actually parallelized.
Is this a problem with my code or the compiler?

MatColgrove · October 19, 2010, 11:03pm

Is this a problem with my code or the compiler?

Sorry, I’ll need more information. What other informational messages are printed? Can you post an example?

Mat

WENYANG_LIU · October 20, 2010, 12:06am

Hi Mat,

Here is an example:

      program para

      implicit none

      integer::f(10),f_j(10),fsum(10)
      integer::i,j
      fsum=0
!$acc data region local(f),copy(fsum)
!$acc region
!$acc do private(f_j)
      do i=1,10
         f(i)=i
         do j=1,10
            f_j(j)=f(i)+10
         end do
         fsum(i)=sum(f_j)
      end do
!$acc end region
!$acc end data region

      write(*,*)fsum

      end program

para:
      8, Generating local(f(:))
         Generating copy(fsum(:))
      9, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(10)
             CC 1.0 : 14 registers; 20 shared, 32 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 14 registers; 20 shared, 32 constant, 0 local memory bytes; 25 occupancy
     13, Loop is parallelizable
     16, sum reduction inlined
         Loop is parallelizable

I used “do private” for i-loop based on the compilation message that it requires privatization of array ‘f_j(1:10)’.

Thanks.

MatColgrove · October 20, 2010, 5:02pm

Hi WENYANG LIU,

In this case, while the “j” loop and the sum reduction are parallelizable, the compiler had determined the optimal schedule is to only create a kernel for the “i” loop with the body of the loop containing the kernel. The only alternate schedule would be to break the loop into three kernels. In which case, you’ll need to rewrite the code a bit:

$ cat test.f90 

      program para

      implicit none

      integer::f(10),f_j(10,10),fsum(10)
      integer::i,j
      fsum=0
!$acc data region local(f),copy(fsum)
!$acc region
      do i=1,10
         f(i)=i
      enddo
      do i=1,10
         do j=1,10
            f_j(i,j)=f(i)+10
         end do
      enddo
      do i=1,10
         fsum(i)=sum(f_j(i,:))
      end do
!$acc end region
!$acc end data region

      write(*,*)fsum

      end program


$ pgf90 -ta=nvidia -Minfo=accel test.f90 
para:
      9, Generating local(f(:))
         Generating copy(fsum(:))
     10, Generating copyout(f_j(1:10,1:10))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(10)
             CC 1.0 : 4 registers; 20 shared, 52 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 4 registers; 20 shared, 52 constant, 0 local memory bytes; 25 occupancy
     14, Loop is parallelizable
     15, Loop is parallelizable
         Accelerator kernel generated
         14, !$acc do parallel, vector(10)
             Cached references to size [10] block of 'f'
         15, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 64 shared, 52 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 64 shared, 52 constant, 0 local memory bytes; 100 occupancy
     19, Loop is parallelizable
         Accelerator kernel generated
         19, !$acc do parallel, vector(10)
             CC 1.0 : 8 registers; 20 shared, 48 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 20 shared, 48 constant, 0 local memory bytes; 25 occupancy
     20, Loop is parallelizable
$ a.out
          110          120          130          140          150          160 
          170          180          190          200

Mat