Can anybody help me to accelerate region:
168 !$acc data copy(CG) copyout(U,STRESS) copyin(PI,B,N1,N2,NG3,NG2)
169 !$acc parallel
170 !$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U)
171 do J3 = J3min,J3max
172 if (J3.gt.NG3/2) then
173 I3 = J3 - NG3
174 else
175 I3 = J3
176 endif
177 !$acc loop
178 do J2 = J2min,J2max
179 if (J2.gt.NG2/2) then
180 I2 = J2 - NG2
181 else
182 I2 = J2
183 endif
184 !$acc loop private(g) reduction(CG)
185 do J1 = 0,N1-1
186 if (J1.gt.N1/2) then
187 I1 = J1 - N1
188 else
189 I1 = J1
190 endif
191 G(1)= B(1,1) * I1 + B(1,2) * I2 + B(1,3) * I3
192 G(2)= B(2,1) * I1 + B(2,2) * I2 + B(2,3) * I3
193 G(3)= B(3,1) * I1 + B(3,2) * I2 + B(3,3) * I3
194 G2 = G(1)**2 + G(2)**2 + G(3)**2
195 J2L = J2 - J2min
196 J3L = J3 - J3min
197 J = 1 + J1 + N1 * J2L + N1 * N2 * J3L
198 if (G2.LT.G2MAX .AND. G2.GT.TINY) then
199 VG = 8.0_dp * PI / G2
200 DU = VG * ( CG(1,J)**2 + CG(2,J)**2 )
201 U = U + DU
202 C = 2.0_dp * DU / G2
203
204 s11 = s11 + C * G(1) * G(1)
205 s21 = s21 + C * G(1) * G(2)
206 s31 = s31 + C * G(1) * G(3)
207
208 s12 = s12 + C * G(2) * G(1)
209 s22 = s22 + C * G(2) * G(2)
210 s32 = s32 + C * G(2) * G(3)
211
212 s13 = s13 + C * G(3) * G(1)
213 s23 = s23 + C * G(3) * G(2)
214 s33 = s33 + C * G(3) * G(3)
215
216 ! DO IX = 1,3
217 ! DO JX = 1,3
218 ! STRESS(JX,IX) = STRESS(JX,IX) + C * G(IX) * G(JX)
219 ! ENDDO
220 ! ENDDO
221
222 CG(1,J) = VG * CG(1,J)
223 CG(2,J) = VG * CG(2,J)
224 else
225 CG(1,J) = 0.0_dp
226 CG(2,J) = 0.0_dp
227 endif
228 enddo
229 !$end loop
230 enddo
231 !$end loop
232 enddo
233 !$end loop
234 !$acc end parallel
235 !$acc end data
when compile with
pgfortran -V12.8 -c -g -acc -ta=nvidia:4.2 -Minfo
i have:
168, Generating copyout(stress(:,:))
Generating copyout(u)
Generating copyin(ng2)
Generating copyin(ng3)
Generating copyin(n2)
Generating copyin(n1)
Generating copyin(b(:,:))
Generating copyin(pi)
Generating copy(cg(:,:))
169, Accelerator kernel generated
169, CC 1.3 : 64 registers; 48 shared, 252 constant, 8 local memory bytes
CC 2.0 : 63 registers; 0 shared, 312 constant, 0 local memory bytes
171, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
169, Generating copy(cg(:,:))
Generating copyin(b(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
178, Loop carried reuse of 'cg' prevents parallelization
185, Loop carried reuse of 'cg' prevents parallelization
Complex loop carried dependence of 'cg' prevents parallelization
there are reduction around CG, how to around this problem