Reduction not recognized in Fortran

_Sayan · May 31, 2012, 11:02pm

Hello,

This is my case, t_ptr is a pointer to pointer to a 3-D array (t_ptr => ptr => u1), and sclr is a scalar value. Following is my code snippet, which gives a “Segmentation Fault”.

       #ifdef __PGI
       !$acc data region copy(t_ptr) copyin(sx, sy, sz, sclr)
       !$acc region do parallel
       #endif
       do i=1,1
          t_ptr(sx, sy, sz) = t_ptr(sx, sy, sz) + sclr
       enddo
       #ifdef __PGI
       !$acc end region
       !$acc end data region
       #endif

Following is the informational messages:

     13, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
     32, Loop unrolled 3 times (completely unrolled)
     36, Generating copyin(sclr)
         Generating copyin(sz)
         Generating copyin(sy)
         Generating copyin(sx)
         Generating copy(ptr(:,:,:))
     38, Generating copy(t_ptr(sx,sy,sz))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     40, Complex loop carried dependence of 't_ptr' prevents parallelization
         Loop carried dependence due to exposed use of 't_ptr(sx,sy,sz)' prevents parallelization
         Accelerator kernel generated
         40, !$acc do seq
         CC 1.0 : 2 registers; 44 shared, 0 constant, 0 local memory bytes; 33% occupancy
         CC 2.0 : 4 registers; 0 shared, 60 constant, 0 local memory bytes; 16% occupancy

How can I solve this error?

Thanks
Sayan

MatColgrove · June 1, 2012, 12:30am

Hi Sayan,

There a couple of issues here.

First, F90 pointers are yet supported within Accelerator regions. We need to aliasing issues before this can be added but our hope is that it’s a solvable problem.

Second, you don’t want to copy your scalars in the data region. This promotes them to global memory hurting performance. Instead, remove the copyin clause and let the compiler either privatize them or pass them in as arguments to the generated kernel.

Third, the loop is not parallel since every iteration of the loop updates the same element of t_ptr. Granted, the trip count is 1 so there’s no parallelism to begin with, but the compiler’s dependency analysis doesn’t take the trip count into account when determining independence.

Mat

_Sayan · June 1, 2012, 12:52am

Thank you for your reply. I guess the only way left would be to access the arrays directly, and let the compiler recognize the reduction.

Please bear with me, I am new in Fortran and after your reply I have a question w.r.t array processing. If there is an operation like:

A = A + B
or
A = A + (B*C)

where, A, B and C are arrays with same shape, then if such an operation
occurs within a compute region then would this operation be moved to the device?

Thank you
Sayan

MatColgrove · June 1, 2012, 3:04pm

Hi Sayan,

Yes, array syntax is supported. So the following would accelerate:

!$acc region
A = A + B 
!$acc end region

Array syntax gets expanded by the compiler into an implied DO loop and then accelerated after the expansion.

Mat

_Sayan · June 1, 2012, 3:50pm

Hello Mat,

Thank you once again. Referring to my original question, I have removed the pointers and I use an array, like this:

  !$acc region
  ....some other code
  ....some other code
  do k=k0,k1
   do j=j0,j1
      do i=i0,i1
          u(i,j,k)=u(i,j,k)+k
       enddo
      enddo
   enddo
  !$acc end region

I use the following compiler optimization options:

-Mpreprocess -fastsse -Mvect=noaltcode -Mipa=fast -mp=numa -ta=nvidia,host -Minfo=accel,loop,opt -Mneginfo

Is there a way that the compiler would be able to recognize the array reduction inside the acc region and make it parallel?

Thank you
Sayan

MatColgrove · June 1, 2012, 5:46pm

Hi Sayan,

Is there a way that the compiler would be able to recognize the array reduction inside the acc region and make it parallel?

Yes, the compiler will automatically detect reductions and generate an optimal parallel reduction code. Though, the reduction variable must be a scalar and the reduction performed across the outer loop.

For example:

sum =0 
!$acc region
    do k=k0,k1
   do j=j0,j1
      do i=i0,i1
          sum = sum + u(i,j,k)
       enddo
      enddo
   enddo
  !$acc end region

As part of our OpenACC implementation were working on adding support for inner loop reductions.

!$acc parallel
!$acc loop gang
    do k=k0,k1
    sum = 0
!$acc loop vector(32), reduction(+:sum)
   do j=j0,j1
          sum = sum + u(j,k)
      enddo
     array(k) = sum
   enddo
  !$acc end parallel

Mat

_Sayan · June 1, 2012, 6:06pm

Thanks, great info.

Topic		Replies	Views
reduction operation Legacy PGI Compilers	8	12604	September 10, 2010
Problem accelerating nested arrays Legacy PGI Compilers	5	7109	August 4, 2010
Segmentation fault Legacy PGI Compilers	1	3008	April 4, 2011
PGI 14.1 Fortran/acc bug report Legacy PGI Compilers	3	4629	June 17, 2014
Version upgrade causes code failure Legacy PGI Compilers	7	5781	May 27, 2016
Fortran -> C in OpenACC Legacy PGI Compilers	1	1508	May 7, 2018
OpenACC reduction for complex variables in FORTRAN Legacy PGI Compilers	3	7502	September 30, 2014
[Help] Using reduction with Array Legacy PGI Compilers	14	3130	March 21, 2024
Parallel construct reductions Legacy PGI Compilers	3	4102	January 25, 2014
Sharing device data with subroutines and Fortran !$acc direc Legacy PGI Compilers	5	8889	July 20, 2010

Reduction not recognized in Fortran

Related topics