how to avoid this dependency

!!acc kernels do private(i) !not allowed due to loop depenency
! do i = 1, nn
! sz(i) = D(i)*( s(i) - A(i,1) * sz(i+JA1) &
! - A(i,2) * sz(i+JA2) &
! - A(i,3) * sz(i+JA3) )
! end do
!!acc end kernels

the code above is not allowed to be accelarated because of its depenency of sz.

I tried to solve that by praparing another sz called Fsz like this
double precision,dimension(n1:n2) :: sz
double precision,dimension(n1:n2) :: Fsz

do i = n1, n2
Fsz(i) = sz(i)
end do
and change the code as below
!$acc kernels do private(i)
do i = 1, nn
sz(i) = D(i)*( s(i) - A(i,1) * Fsz(i+JA1) &

  • A(i,2) * Fsz(i+JA2) &
  • A(i,3) * Fsz(i+JA3) )
    end do
    !$acc end kernels

Was I wrong with my idea? Anyway I just cannot get a right result.
(I had already sent the code to Mat and got very useful advices)

It seems like I made a mistake there.
For the code below,
do i = 1, nn
sz(i) = D(i)*( s(i) - A(i,1) * sz(i+JA1) &

  • A(i,2) * sz(i+JA2) &
  • A(i,3) * sz(i+JA3) )
    end do

sz will use the latest sz among the do loop. So actually it won`t equal to
do i = 1, nn
sz(i) = D(i)*( s(i) - A(i,1) * Fsz(i+JA1) &

  • A(i,2) * Fsz(i+JA2) &
  • A(i,3) * Fsz(i+JA3) )
    end do

Then I was wrong. But isn`t there any solution?

Do I =2, n Do I =2, n
X(i) = X(i) + X(i-1) X(i) = X(i) + X(i+1)Enddo Enddo
were different.
I was supposed to give up since the sy(i+JA2)s` JA2 was -1. If it is -10 or any number smaller, I may get a solution.

Hi Kevin,

You’re strategy of using Fsz is correct for forward dependencies since the value of sz(i+N) is fixed relative to the value of sz(i). However you’re JA’s are negative resulting in a backwards dependency so there’s not much that can be done except run the loop sequentially.

  • Mat

Hi, everyone.
To get rid of the data dependency above, I had tried to use the Multigrid method instead of ILU. But I still met problems which couldn`t solve right now.
Part of the code is like this
!$acc kernels
do i = 1, NI*NJ
r(i) = c0
r0(i) = c0
p(i) = c0
yy(i) = c0
e(i) = c0
v(i) = c0
end do
c
c do i = 1, nn
c r(i) = B(i) - A(i,1) * X(i+JA1) - A(i,2) * X(i+JA2) &
c - A(i,3) * X(i) &
c - A(i,4) * X(i+JA4) - A(i,5) * X(i+JA5)
c end do
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
!$acc do private®
DO IJ=II+2,II+NJM
r(IJ)=QA(IJ)-APA(IJ)*FIA(IJ)-AEA(IJ)*FIA(IJ+NJ)-

  •      AWA(IJ)*FIA(IJ-NJ)-ASA(IJ)*FIA(IJ-1)-ANA(IJ)*FIA(IJ+1)
    

END DO
END DO
c
c c1 = c0
c do i = 1, nn
c c1 = c1 + r(i) * r(i)
c end do

DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
c1 = c1 + r(IJ) * r(IJ)
END DO
END DO

c bb = c0
c do i = 1, nn
c bb = bb + B(i) * B(i)
c end do
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
bb = bb + QA(IJ) * QA(IJ)
END DO
END DO

DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
!$acc do private(p, r0)
DO IJ=II+2,II+NJM
p(IJ) = r(IJ)
r0(IJ) = r(IJ)
END DO
END DO
!$acc end kernels

and the compiler message seems no abnormality even though I dont know why I was supposed to add do private somewhere or dont elsewhere.

bicgstabmg:
2085, Generating present_or_copyin(ijgr(l))
Generating present_or_copyin(ana(:))
Generating present_or_copyin(asa(:))
Generating present_or_copyin(awa(:))
Generating present_or_copyin(aea(:))
Generating present_or_copyin(apa(:))
Generating present_or_copyin(fia(:))
Generating present_or_copyin(qa(:))
Generating copyin(r(:))
Generating copyout(r(:ninj))
Generating present_or_copyout(r0(:ni
nj))
Generating present_or_copyout(p(:ninj))
Generating present_or_copyout(yy(:ni
nj))
Generating present_or_copyout(e(:ninj))
Generating present_or_copyout(v(:ni
nj))
Generating compute capability 2.0 binary
2086, Loop is parallelizable
Accelerator kernel generated
2086, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 2.0 : 12 registers; 0 shared, 108 constant, 0 local memory bytes
2100, Loop is parallelizable
Accelerator kernel generated
2100, !$acc loop gang ! blockidx%x
CC 2.0 : 24 registers; 16 shared, 144 constant, 0 local memory bytes
2103, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
2114, Loop is parallelizable
Accelerator kernel generated
2114, !$acc loop gang ! blockidx%x
CC 2.0 : 16 registers; 16 shared, 92 constant, 0 local memory bytes
2116, !$acc loop vector(128) ! threadidx%x
2117, Sum reduction generated for c1
2116, Loop is parallelizable
2125, Loop is parallelizable
Accelerator kernel generated
2125, !$acc loop gang ! blockidx%x
CC 2.0 : 16 registers; 16 shared, 92 constant, 0 local memory bytes
2127, !$acc loop vector(128) ! threadidx%x
2128, Sum reduction generated for bb
2127, Loop is parallelizable
2132, Loop is parallelizable
Accelerator kernel generated
2132, !$acc loop gang ! blockidx%x
CC 2.0 : 18 registers; 16 shared, 108 constant, 0 local memory bytes
2135, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable

I am ashamed to admitted that after a lot of practice I still am an amateur

Hi Kevin,

You need to be careful when using the “private” clause. When you privatize an object, it’s lifetime is the same as the kernel in which it was was used. Hence, by putting “r”, “r0” and “p” in a private clause, the values are thrown away at the end of the loop. This is what’s causing you’re wrong answers and hence, you need to remove these clauses.

Also, only rectangular loops can be accelerated, hence the inner IJ loops wont accelerate. The outer I loops should be ok, but you may need to add an “!$acc loop independent” clause around it. Due to the use of a compiled inner loop bounds, the compiler can’t tell that array updates in the IJ loop don’t overlap across each iteration of I.

Hope this helps,
Mat

Hi, Mat

Thank you very much. That really helps me.

While the parts below
!$acc loop independent
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
p(IJ) = r(IJ)
r0(IJ) = r(IJ)
END DO
END DO
still told me that
Loop carried reuse of ‘p’ prevents parallelization
Loop carried reuse of ‘r0’ prevents parallelization

Anyidea to solve it?

I have found something interesting.
The code that I questioned before is
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
p(IJ) = r(IJ)
r0(IJ) = r(IJ)
END DO
END DO

while if I change the DoLoops to 1. It wont tell me wrong and work correctly. And no bother to add "loop independent" To be clear, I just change it to Do IJ=NJ+IJGR(L)+2, (NIM-1)*NJ+IJGR(L)+NJM p(IJ) = r(IJ) r0(IJ) = r(IJ) END DO So, Id admitted it my fault to write it in tha way and compiler is not the God who knows anything.

Well, the interesting is not over. For all of my code rest, the type just like
A(IJ) = A(IJ) +somethingelse
or
A(IJ) = B(IJ) +somethingelse
would tell me wrong while the type below without anyproblem
A = A +somethingelse
or
A = B +somethingelse
I think it is fun but just don`t know why.

I would like to change the topic now.
I am a little bit puzzled about how big the area we are supposed to use the !$acc data clause. The wider the better? Or in other words, we should to keep some data stay in the Gpu as small as a constant(not an array) as long as it was used frequently?
Any tell would be appreciated.

While I got a success on the change from 2 doloops to 1, I still could not rely on my luck to solve the whole. Because I met this doloops
NIC=NIGR(L)
NJC=NJGR(L)
DO IC=2,NIC-1
IIC=(IC-1)NJC+IJGR(L)
IF=2
IC-2
IIF=(IF-1)NJF+IJGR(L+1)
DO JC=2,NJC-1
IJC=IIC+JC
JF=2
JC-2
IJF=IIF+JF
QA(IJC)=RES(IJF)+RES(IJF+1)+RES(IJF+NJF)+RES(IJF+NJF+1)
FIA(IJC)=0.
END DO
END DO
I have tried a lot to escape the strange talked above but still could not get a version that the compiler wont tell wrong. Of course,if I just let it be and add the directives it wont too. Any idea about that?
Still, I was disappointed of the time cost. I wish the data would stay on the Gpu or else it would cost so much. It is not a good idea to mix the Cpu codes with Gpus up,isnt it. But how would it be when you dont know to write ACCs to the all.

Hi Kevin,

I’m attending SC12 right now and this question deserves more then the few minutes I have right now. Once I’m back in the office, I’ll take a closer look here and see what we can do.

Please send me the updated version of your code since this will help me understand where your at.

  • Mat

Dear Mat,

Thank you. I have already mailed you with the updated code. No hurry and I will appreciate your help

Hi Kevin,

The basic problem is that your problem size is so small that the overhead of launching kernels and performing the data look-up on the device is dominating the run time. Can you increase your problem size?

  • Mat