how to avoid this dependency

KevinWoo · November 2, 2012, 1:40pm

!!acc kernels do private(i) !not allowed due to loop depenency
! do i = 1, nn
! sz(i) = D(i)*( s(i) - A(i,1) * sz(i+JA1) &
! - A(i,2) * sz(i+JA2) &
! - A(i,3) * sz(i+JA3) )
! end do
!!acc end kernels

the code above is not allowed to be accelarated because of its depenency of sz.

I tried to solve that by praparing another sz called Fsz like this
double precision,dimension(n1:n2) :: sz
double precision,dimension(n1:n2) :: Fsz
…
do i = n1, n2
Fsz(i) = sz(i)
end do
and change the code as below
!$acc kernels do private(i)
do i = 1, nn
sz(i) = D(i)*( s(i) - A(i,1) * Fsz(i+JA1) &

A(i,2) * Fsz(i+JA2) &
A(i,3) * Fsz(i+JA3) )
end do
!$acc end kernels

Was I wrong with my idea? Anyway I just cannot get a right result.
(I had already sent the code to Mat and got very useful advices)

KevinWoo · November 3, 2012, 1:14am

It seems like I made a mistake there.
For the code below,
do i = 1, nn
sz(i) = D(i)*( s(i) - A(i,1) * sz(i+JA1) &

A(i,2) * sz(i+JA2) &
A(i,3) * sz(i+JA3) )
end do

sz will use the latest sz among the do loop. So actually it won`t equal to
do i = 1, nn
sz(i) = D(i)*( s(i) - A(i,1) * Fsz(i+JA1) &

A(i,2) * Fsz(i+JA2) &
A(i,3) * Fsz(i+JA3) )
end do

Then I was wrong. But isn`t there any solution?

KevinWoo · November 3, 2012, 5:46am

Do I =2, n Do I =2, n
X(i) = X(i) + X(i-1) X(i) = X(i) + X(i+1)Enddo Enddo
were different.
I was supposed to give up since the sy(i+JA2)s` JA2 was -1. If it is -10 or any number smaller, I may get a solution.

MatColgrove · November 5, 2012, 11:17pm

Hi Kevin,

You’re strategy of using Fsz is correct for forward dependencies since the value of sz(i+N) is fixed relative to the value of sz(i). However you’re JA’s are negative resulting in a backwards dependency so there’s not much that can be done except run the loop sequentially.

Mat

KevinWoo · November 8, 2012, 2:35pm

Hi, everyone.
To get rid of the data dependency above, I had tried to use the Multigrid method instead of ILU. But I still met problems which couldn`t solve right now.
Part of the code is like this
!$acc kernels
do i = 1, NI*NJ
r(i) = c0
r0(i) = c0
p(i) = c0
yy(i) = c0
e(i) = c0
v(i) = c0
end do
c
c do i = 1, nn
c r(i) = B(i) - A(i,1) * X(i+JA1) - A(i,2) * X(i+JA2) &
c - A(i,3) * X(i) &
c - A(i,4) * X(i+JA4) - A(i,5) * X(i+JA5)
c end do
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
!$acc do private(r)
DO IJ=II+2,II+NJM
r(IJ)=QA(IJ)-APA(IJ)*FIA(IJ)-AEA(IJ)*FIA(IJ+NJ)-

     AWA(IJ)*FIA(IJ-NJ)-ASA(IJ)*FIA(IJ-1)-ANA(IJ)*FIA(IJ+1)

END DO
END DO
c
c c1 = c0
c do i = 1, nn
c c1 = c1 + r(i) * r(i)
c end do

DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
c1 = c1 + r(IJ) * r(IJ)
END DO
END DO

c bb = c0
c do i = 1, nn
c bb = bb + B(i) * B(i)
c end do
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
bb = bb + QA(IJ) * QA(IJ)
END DO
END DO

DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
!$acc do private(p, r0)
DO IJ=II+2,II+NJM
p(IJ) = r(IJ)
r0(IJ) = r(IJ)
END DO
END DO
!$acc end kernels

and the compiler message seems no abnormality even though I dont know why I was supposed to add do private somewhere or dont elsewhere.

bicgstabmg:
2085, Generating present_or_copyin(ijgr(l))
Generating present_or_copyin(ana(:))
Generating present_or_copyin(asa(:))
Generating present_or_copyin(awa(:))
Generating present_or_copyin(aea(:))
Generating present_or_copyin(apa(:))
Generating present_or_copyin(fia(:))
Generating present_or_copyin(qa(:))
Generating copyin(r(:))
Generating copyout(r(:ninj))
Generating present_or_copyout(r0(:ninj))
Generating present_or_copyout(p(:ninj))
Generating present_or_copyout(yy(:ninj))
Generating present_or_copyout(e(:ninj))
Generating present_or_copyout(v(:ninj))
Generating compute capability 2.0 binary
2086, Loop is parallelizable
Accelerator kernel generated
2086, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 2.0 : 12 registers; 0 shared, 108 constant, 0 local memory bytes
2100, Loop is parallelizable
Accelerator kernel generated
2100, !$acc loop gang ! blockidx%x
CC 2.0 : 24 registers; 16 shared, 144 constant, 0 local memory bytes
2103, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
2114, Loop is parallelizable
Accelerator kernel generated
2114, !$acc loop gang ! blockidx%x
CC 2.0 : 16 registers; 16 shared, 92 constant, 0 local memory bytes
2116, !$acc loop vector(128) ! threadidx%x
2117, Sum reduction generated for c1
2116, Loop is parallelizable
2125, Loop is parallelizable
Accelerator kernel generated
2125, !$acc loop gang ! blockidx%x
CC 2.0 : 16 registers; 16 shared, 92 constant, 0 local memory bytes
2127, !$acc loop vector(128) ! threadidx%x
2128, Sum reduction generated for bb
2127, Loop is parallelizable
2132, Loop is parallelizable
Accelerator kernel generated
2132, !$acc loop gang ! blockidx%x
CC 2.0 : 18 registers; 16 shared, 108 constant, 0 local memory bytes
2135, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable

I am ashamed to admitted that after a lot of practice I still am an amateur

MatColgrove · November 8, 2012, 9:08pm

Hi Kevin,

You need to be careful when using the “private” clause. When you privatize an object, it’s lifetime is the same as the kernel in which it was was used. Hence, by putting “r”, “r0” and “p” in a private clause, the values are thrown away at the end of the loop. This is what’s causing you’re wrong answers and hence, you need to remove these clauses.

Also, only rectangular loops can be accelerated, hence the inner IJ loops wont accelerate. The outer I loops should be ok, but you may need to add an “!$acc loop independent” clause around it. Due to the use of a compiled inner loop bounds, the compiler can’t tell that array updates in the IJ loop don’t overlap across each iteration of I.

Hope this helps,
Mat

KevinWoo · November 9, 2012, 1:05am

Hi, Mat

Thank you very much. That really helps me.

While the parts below
!$acc loop independent
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
p(IJ) = r(IJ)
r0(IJ) = r(IJ)
END DO
END DO
still told me that
Loop carried reuse of ‘p’ prevents parallelization
Loop carried reuse of ‘r0’ prevents parallelization

Anyidea to solve it?

KevinWoo · November 9, 2012, 3:03pm

I have found something interesting.
The code that I questioned before is
DO I=2,NIM
II=(I-1)*NJ+IJGR(L)
DO IJ=II+2,II+NJM
p(IJ) = r(IJ)
r0(IJ) = r(IJ)
END DO
END DO

while if I change the DoLoops to 1. It wont tell me wrong and work correctly. And no bother to add "loop independent" To be clear, I just change it to Do IJ=NJ+IJGR(L)+2, (NIM-1)*NJ+IJGR(L)+NJM p(IJ) = r(IJ) r0(IJ) = r(IJ) END DO So, Id admitted it my fault to write it in tha way and compiler is not the God who knows anything.

Well, the interesting is not over. For all of my code rest, the type just like
A(IJ) = A(IJ) +somethingelse
or
A(IJ) = B(IJ) +somethingelse
would tell me wrong while the type below without anyproblem
A = A +somethingelse
or
A = B +somethingelse
I think it is fun but just don`t know why.

I would like to change the topic now.
I am a little bit puzzled about how big the area we are supposed to use the !$acc data clause. The wider the better? Or in other words, we should to keep some data stay in the Gpu as small as a constant(not an array) as long as it was used frequently?
Any tell would be appreciated.

KevinWoo · November 12, 2012, 3:58pm

While I got a success on the change from 2 doloops to 1, I still could not rely on my luck to solve the whole. Because I met this doloops
NIC=NIGR(L)
NJC=NJGR(L)
DO IC=2,NIC-1
IIC=(IC-1)NJC+IJGR(L)
IF=2IC-2
IIF=(IF-1)NJF+IJGR(L+1)
DO JC=2,NJC-1
IJC=IIC+JC
JF=2JC-2
IJF=IIF+JF
QA(IJC)=RES(IJF)+RES(IJF+1)+RES(IJF+NJF)+RES(IJF+NJF+1)
FIA(IJC)=0.
END DO
END DO
I have tried a lot to escape the strange talked above but still could not get a version that the compiler wont tell wrong. Of course,if I just let it be and add the directives it wont too. Any idea about that?
Still, I was disappointed of the time cost. I wish the data would stay on the Gpu or else it would cost so much. It is not a good idea to mix the Cpu codes with Gpus up,isnt it. But how would it be when you dont know to write ACCs to the all.

MatColgrove · November 14, 2012, 3:43pm

Hi Kevin,

I’m attending SC12 right now and this question deserves more then the few minutes I have right now. Once I’m back in the office, I’ll take a closer look here and see what we can do.

Please send me the updated version of your code since this will help me understand where your at.

Mat

KevinWoo · November 16, 2012, 12:24pm

Dear Mat,

Thank you. I have already mailed you with the updated code. No hurry and I will appreciate your help

MatColgrove · November 16, 2012, 11:11pm

Hi Kevin,

The basic problem is that your problem size is so small that the overhead of launching kernels and performing the data look-up on the device is dominating the run time. Can you increase your problem size?

Mat

Topic		Replies	Views
help understanding compiler information Legacy PGI Compilers	7	5305	October 9, 2012
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8133	October 22, 2013
accelerator parallization issues Legacy PGI Compilers	18	26846	April 12, 2010
understanding problems with acc directives. Legacy PGI Compilers	7	12732	May 3, 2010
Need advice for OpenACC directives Legacy PGI Compilers	6	7352	July 5, 2016
Privatization of array Legacy PGI Compilers	9	17664	July 14, 2010
dependence in loop prevents parallelization Legacy PGI Compilers	3	8806	February 9, 2010
Compiler appears to not privatize thread local private data Legacy PGI Compilers	1	2097	February 3, 2012
Compile error and wrong diagnosis of loop carried dependence Legacy PGI Compilers	3	3067	February 3, 2012
Six Loops iteration and reduction Legacy PGI Compilers	15	7990	March 27, 2012

how to avoid this dependency

Related topics