Reduction not recognized in Fortran

Hello,

This is my case, t_ptr is a pointer to pointer to a 3-D array (t_ptr => ptr => u1), and sclr is a scalar value. Following is my code snippet, which gives a “Segmentation Fault”.

       #ifdef __PGI
       !$acc data region copy(t_ptr) copyin(sx, sy, sz, sclr)
       !$acc region do parallel
       #endif
       do i=1,1
          t_ptr(sx, sy, sz) = t_ptr(sx, sy, sz) + sclr
       enddo
       #ifdef __PGI
       !$acc end region
       !$acc end data region
       #endif

Following is the informational messages:

     13, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
     32, Loop unrolled 3 times (completely unrolled)
     36, Generating copyin(sclr)
         Generating copyin(sz)
         Generating copyin(sy)
         Generating copyin(sx)
         Generating copy(ptr(:,:,:))
     38, Generating copy(t_ptr(sx,sy,sz))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     40, Complex loop carried dependence of 't_ptr' prevents parallelization
         Loop carried dependence due to exposed use of 't_ptr(sx,sy,sz)' prevents parallelization
         Accelerator kernel generated
         40, !$acc do seq
         CC 1.0 : 2 registers; 44 shared, 0 constant, 0 local memory bytes; 33% occupancy
         CC 2.0 : 4 registers; 0 shared, 60 constant, 0 local memory bytes; 16% occupancy

How can I solve this error?

Thanks
Sayan

Hi Sayan,

There a couple of issues here.

First, F90 pointers are yet supported within Accelerator regions. We need to aliasing issues before this can be added but our hope is that it’s a solvable problem.

Second, you don’t want to copy your scalars in the data region. This promotes them to global memory hurting performance. Instead, remove the copyin clause and let the compiler either privatize them or pass them in as arguments to the generated kernel.

Third, the loop is not parallel since every iteration of the loop updates the same element of t_ptr. Granted, the trip count is 1 so there’s no parallelism to begin with, but the compiler’s dependency analysis doesn’t take the trip count into account when determining independence.

  • Mat

Thank you for your reply. I guess the only way left would be to access the arrays directly, and let the compiler recognize the reduction.

Please bear with me, I am new in Fortran and after your reply I have a question w.r.t array processing. If there is an operation like:

A = A + B
or
A = A + (B*C)

where, A, B and C are arrays with same shape, then if such an operation
occurs within a compute region then would this operation be moved to the device?

Thank you
Sayan

Hi Sayan,

Yes, array syntax is supported. So the following would accelerate:

!$acc region
A = A + B 
!$acc end region

Array syntax gets expanded by the compiler into an implied DO loop and then accelerated after the expansion.

  • Mat

Hello Mat,

Thank you once again. Referring to my original question, I have removed the pointers and I use an array, like this:

  !$acc region
  ....some other code
  ....some other code
  do k=k0,k1
   do j=j0,j1
      do i=i0,i1
          u(i,j,k)=u(i,j,k)+k
       enddo
      enddo
   enddo
  !$acc end region

I use the following compiler optimization options:

-Mpreprocess -fastsse -Mvect=noaltcode -Mipa=fast -mp=numa -ta=nvidia,host -Minfo=accel,loop,opt -Mneginfo

Is there a way that the compiler would be able to recognize the array reduction inside the acc region and make it parallel?

Thank you
Sayan

Hi Sayan,

Is there a way that the compiler would be able to recognize the array reduction inside the acc region and make it parallel?

Yes, the compiler will automatically detect reductions and generate an optimal parallel reduction code. Though, the reduction variable must be a scalar and the reduction performed across the outer loop.

For example:

sum =0 
!$acc region
    do k=k0,k1
   do j=j0,j1
      do i=i0,i1
          sum = sum + u(i,j,k)
       enddo
      enddo
   enddo
  !$acc end region

As part of our OpenACC implementation were working on adding support for inner loop reductions.

!$acc parallel
!$acc loop gang
    do k=k0,k1
    sum = 0
!$acc loop vector(32), reduction(+:sum)
   do j=j0,j1
          sum = sum + u(j,k)
      enddo
     array(k) = sum
   enddo
  !$acc end parallel
  • Mat

Thanks, great info.