# Need help to accelerate

Can anybody help me to accelerate region:

``````   168  !\$acc data copy(CG) copyout(U,STRESS) copyin(PI,B,N1,N2,NG3,NG2)
169  !\$acc parallel
170  !\$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U)
171        do J3 = J3min,J3max
172          if (J3.gt.NG3/2) then
173            I3 = J3 - NG3
174          else
175            I3 = J3
176          endif
177  !\$acc loop
178          do J2 = J2min,J2max
179            if (J2.gt.NG2/2) then
180              I2 = J2 - NG2
181            else
182              I2 = J2
183            endif
184  !\$acc loop private(g) reduction(CG)
185            do J1 = 0,N1-1
186              if (J1.gt.N1/2) then
187                I1 = J1 - N1
188              else
189                I1 = J1
190              endif
191              G(1)= B(1,1) * I1 + B(1,2) * I2 + B(1,3) * I3
192              G(2)= B(2,1) * I1 + B(2,2) * I2 + B(2,3) * I3
193              G(3)= B(3,1) * I1 + B(3,2) * I2 + B(3,3) * I3
194              G2 = G(1)**2 + G(2)**2 + G(3)**2
195              J2L = J2 - J2min
196              J3L = J3 - J3min
197              J = 1 + J1 + N1 * J2L + N1 * N2 * J3L
198              if (G2.LT.G2MAX .AND. G2.GT.TINY) then
199                VG = 8.0_dp * PI / G2
200                DU = VG * ( CG(1,J)**2 + CG(2,J)**2 )
201                U = U + DU
202                C = 2.0_dp * DU / G2
203
204                 s11 = s11 + C * G(1) * G(1)
205                 s21 = s21 + C * G(1) * G(2)
206                 s31 = s31 + C * G(1) * G(3)
207
208                 s12 = s12 + C * G(2) * G(1)
209                 s22 = s22 + C * G(2) * G(2)
210                 s32 = s32 + C * G(2) * G(3)
211
212                 s13 = s13 + C * G(3) * G(1)
213                 s23 = s23 + C * G(3) * G(2)
214                 s33 = s33 + C * G(3) * G(3)
215
216  !              DO IX = 1,3
217  !                DO JX = 1,3
218  !                  STRESS(JX,IX) = STRESS(JX,IX) + C * G(IX) * G(JX)
219  !                ENDDO
220  !              ENDDO
221
222                CG(1,J) = VG * CG(1,J)
223                CG(2,J) = VG * CG(2,J)
224              else
225                CG(1,J) = 0.0_dp
226                CG(2,J) = 0.0_dp
227              endif
228            enddo
229  !\$end loop
230          enddo
231  !\$end loop
232        enddo
233  !\$end loop
234  !\$acc end parallel
235  !\$acc end data
``````

when compile with

`````` pgfortran -V12.8 -c -g -acc -ta=nvidia:4.2 -Minfo
``````

i have:

``````168, Generating copyout(stress(:,:))
Generating copyout(u)
Generating copyin(ng2)
Generating copyin(ng3)
Generating copyin(n2)
Generating copyin(n1)
Generating copyin(b(:,:))
Generating copyin(pi)
Generating copy(cg(:,:))
169, Accelerator kernel generated
169, CC 1.3 : 64 registers; 48 shared, 252 constant, 8 local memory bytes
CC 2.0 : 63 registers; 0 shared, 312 constant, 0 local memory bytes
171, !\$acc loop gang, vector(256) ! blockidx%x threadidx%x
169, Generating copy(cg(:,:))
Generating copyin(b(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
178, Loop carried reuse of 'cg' prevents parallelization
185, Loop carried reuse of 'cg' prevents parallelization
Complex loop carried dependence of 'cg' prevents parallelization
``````

there are reduction around CG, how to around this problem

Hi ID#cat,

First, reduction variables must be scalars so CG can’t be used in a reduction clause. However, CG isn’t being reduced but rather the compiler’s complaining that it can’t tell all index of CG are independent due to the use of the compute J index.

The major reason why the inner loops aren’t parallelizing is because you have scalar code in between each loop. Try pushing the J3 and J2 if statements inside the J1 loop. In some cases when using the “kernels” the compiler may automatically perform this transformation, but it wont if using “parallel”.

You may also try using “kernels” instead of “parallel”, just add the “independent” clause to you loop directives to work around CG’s compute index issue.

Hope this helps,
Mat

Thank you. I have changed initialization of the loops

``````169	!\$acc data copy(CG) copyout(U,s11,s21,s31,s12,s22,s32,s13,s23,s33)
170	!\$accx copyin(NG3,NG2,N1,N2,PI,B,G2MAX,TINY)
171	!\$acc kernels
172	!\$acc loop independent
173	      do J3 = J3min,J3max
174	!\$acc loop independent
175	        do J2 = J2min,J2max
176	!\$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U)
177	!\$accx private(G)
178	          do J1 = 0,N1-1
179	            if (J2.gt.NG2/2) then
180	              I2 = J2 - NG2
181	            else
182	              I2 = J2
183	            endif
184
185	            if (J3.gt.NG3/2) then
186	              I3 = J3 - NG3
187	            else
188	              I3 = J3
189	            endif
190
191	            if (J1.gt.N1/2) then
192	              I1 = J1 - N1
193	            else
194	              I1 = J1
195	            endif
``````

it compiles with:

``````169, Generating copyout(s33)
Generating copyout(s23)
Generating copyout(s13)
Generating copyout(s32)
Generating copyout(s22)
Generating copyout(s12)
Generating copyout(s31)
Generating copyout(s21)
Generating copyout(s11)
Generating copyout(u)
Generating copyin(tiny)
Generating copyin(g2max)
Generating copyin(b(:,:))
Generating copyin(pi)
Generating copyin(n2)
Generating copyin(n1)
Generating copyin(ng2)
Generating copyin(ng3)
Generating copy(cg(:,:))
171, Generating present_or_copy(cg(:,:))
Generating present_or_copyin(b(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
173, Loop is parallelizable
175, Loop is parallelizable
Accelerator kernel generated
173, !\$acc loop gang, vector(4) ! blockidx%y threadidx%y
175, !\$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 1.3 : 46 registers; 264 shared, 40 constant, 0 local memory bytes
CC 2.0 : 62 registers; 0 shared, 296 constant, 0 local memory bytes
178, Loop carried reuse of 'cg' prevents parallelization
Complex loop carried dependence of 'cg' prevents parallelization
Inner sequential loop scheduled on accelerator
``````

but I have the run time error:

``````0: ALLOCATE: 18446744071562067970 bytes requested; not enough memory
make: *** [completed_work] Error 127
``````

[/code]

0: ALLOCATE: 18446744071562067970 bytes requested; not enough memory
make: *** [completed_work] Error 127

Most likely a bogus values is being used when allocating device data. Double check that the arrays are all allocated on the host before being passed to the device.

If that’s not it, I’d need to see the full source to determine if it’s an error in your program or a compiler error. Can please post a reproducing example or send one to PGI Customer Service (trs@pgroup.com)?

Thanks,
Mat