Need help to accelerate

Can anybody help me to accelerate region:

   168  !$acc data copy(CG) copyout(U,STRESS) copyin(PI,B,N1,N2,NG3,NG2) 
   169  !$acc parallel
   170  !$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U)
   171        do J3 = J3min,J3max
   172          if (J3.gt.NG3/2) then
   173            I3 = J3 - NG3
   174          else
   175            I3 = J3
   176          endif
   177  !$acc loop
   178          do J2 = J2min,J2max
   179            if (J2.gt.NG2/2) then
   180              I2 = J2 - NG2
   181            else
   182              I2 = J2
   183            endif
   184  !$acc loop private(g) reduction(CG)
   185            do J1 = 0,N1-1
   186              if (J1.gt.N1/2) then
   187                I1 = J1 - N1
   188              else
   189                I1 = J1
   190              endif
   191              G(1)= B(1,1) * I1 + B(1,2) * I2 + B(1,3) * I3
   192              G(2)= B(2,1) * I1 + B(2,2) * I2 + B(2,3) * I3
   193              G(3)= B(3,1) * I1 + B(3,2) * I2 + B(3,3) * I3
   194              G2 = G(1)**2 + G(2)**2 + G(3)**2
   195              J2L = J2 - J2min 
   196              J3L = J3 - J3min 
   197              J = 1 + J1 + N1 * J2L + N1 * N2 * J3L
   198              if (G2.LT.G2MAX .AND. G2.GT.TINY) then
   199                VG = 8.0_dp * PI / G2
   200                DU = VG * ( CG(1,J)**2 + CG(2,J)**2 )
   201                U = U + DU
   202                C = 2.0_dp * DU / G2
   203                
   204                 s11 = s11 + C * G(1) * G(1)
   205                 s21 = s21 + C * G(1) * G(2)
   206                 s31 = s31 + C * G(1) * G(3)
   207                 
   208                 s12 = s12 + C * G(2) * G(1)
   209                 s22 = s22 + C * G(2) * G(2)
   210                 s32 = s32 + C * G(2) * G(3)
   211                 
   212                 s13 = s13 + C * G(3) * G(1)
   213                 s23 = s23 + C * G(3) * G(2)
   214                 s33 = s33 + C * G(3) * G(3)
   215
   216  !              DO IX = 1,3
   217  !                DO JX = 1,3
   218  !                  STRESS(JX,IX) = STRESS(JX,IX) + C * G(IX) * G(JX)
   219  !                ENDDO
   220  !              ENDDO
   221
   222                CG(1,J) = VG * CG(1,J)
   223                CG(2,J) = VG * CG(2,J)
   224              else
   225                CG(1,J) = 0.0_dp
   226                CG(2,J) = 0.0_dp
   227              endif
   228            enddo
   229  !$end loop
   230          enddo
   231  !$end loop
   232        enddo
   233  !$end loop
   234  !$acc end parallel
   235  !$acc end data

when compile with

 pgfortran -V12.8 -c -g -acc -ta=nvidia:4.2 -Minfo

i have:

168, Generating copyout(stress(:,:))
         Generating copyout(u)
         Generating copyin(ng2)
         Generating copyin(ng3)
         Generating copyin(n2)
         Generating copyin(n1)
         Generating copyin(b(:,:))
         Generating copyin(pi)
         Generating copy(cg(:,:))
    169, Accelerator kernel generated
        169, CC 1.3 : 64 registers; 48 shared, 252 constant, 8 local memory bytes
             CC 2.0 : 63 registers; 0 shared, 312 constant, 0 local memory bytes
        171, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
    169, Generating copy(cg(:,:))
         Generating copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    178, Loop carried reuse of 'cg' prevents parallelization
    185, Loop carried reuse of 'cg' prevents parallelization
         Complex loop carried dependence of 'cg' prevents parallelization

there are reduction around CG, how to around this problem

Hi ID#cat,

First, reduction variables must be scalars so CG can’t be used in a reduction clause. However, CG isn’t being reduced but rather the compiler’s complaining that it can’t tell all index of CG are independent due to the use of the compute J index.

The major reason why the inner loops aren’t parallelizing is because you have scalar code in between each loop. Try pushing the J3 and J2 if statements inside the J1 loop. In some cases when using the “kernels” the compiler may automatically perform this transformation, but it wont if using “parallel”.

You may also try using “kernels” instead of “parallel”, just add the “independent” clause to you loop directives to work around CG’s compute index issue.

Hope this helps,
Mat

Thank you. I have changed initialization of the loops

169	!$acc data copy(CG) copyout(U,s11,s21,s31,s12,s22,s32,s13,s23,s33) 
   170	!$accx copyin(NG3,NG2,N1,N2,PI,B,G2MAX,TINY) 
   171	!$acc kernels 
   172	!$acc loop independent 
   173	      do J3 = J3min,J3max
   174	!$acc loop independent
   175	        do J2 = J2min,J2max          
   176	!$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U) 
   177	!$accx private(G)
   178	          do J1 = 0,N1-1
   179	            if (J2.gt.NG2/2) then
   180	              I2 = J2 - NG2
   181	            else
   182	              I2 = J2
   183	            endif
   184	            
   185	            if (J3.gt.NG3/2) then
   186	              I3 = J3 - NG3
   187	            else
   188	              I3 = J3
   189	            endif
   190	            
   191	            if (J1.gt.N1/2) then
   192	              I1 = J1 - N1
   193	            else
   194	              I1 = J1
   195	            endif

it compiles with:

169, Generating copyout(s33)
         Generating copyout(s23)
         Generating copyout(s13)
         Generating copyout(s32)
         Generating copyout(s22)
         Generating copyout(s12)
         Generating copyout(s31)
         Generating copyout(s21)
         Generating copyout(s11)
         Generating copyout(u)
         Generating copyin(tiny)
         Generating copyin(g2max)
         Generating copyin(b(:,:))
         Generating copyin(pi)
         Generating copyin(n2)
         Generating copyin(n1)
         Generating copyin(ng2)
         Generating copyin(ng3)
         Generating copy(cg(:,:))
    171, Generating present_or_copy(cg(:,:))
         Generating present_or_copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    173, Loop is parallelizable
    175, Loop is parallelizable
         Accelerator kernel generated
        173, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
        175, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
             CC 1.3 : 46 registers; 264 shared, 40 constant, 0 local memory bytes
             CC 2.0 : 62 registers; 0 shared, 296 constant, 0 local memory bytes
    178, Loop carried reuse of 'cg' prevents parallelization
         Complex loop carried dependence of 'cg' prevents parallelization
         Inner sequential loop scheduled on accelerator

but I have the run time error:

0: ALLOCATE: 18446744071562067970 bytes requested; not enough memory
make: *** [completed_work] Error 127

[/code]

0: ALLOCATE: 18446744071562067970 bytes requested; not enough memory
make: *** [completed_work] Error 127

Most likely a bogus values is being used when allocating device data. Double check that the arrays are all allocated on the host before being passed to the device.

If that’s not it, I’d need to see the full source to determine if it’s an error in your program or a compiler error. Can please post a reproducing example or send one to PGI Customer Service (trs@pgroup.com)?

Thanks,
Mat