How to parallelize a forward/backward dependency loop

The sample program has forward/backward dependency.
How can we parallelize it?

Thanks,

Wei


program tst
implicit none
real, dimension(0:100, 100) :: a
real, dimension(101, 100) :: d
real, dimension(100) :: b
real :: c
integer :: i, j

c = 1.5

!$acc kernels loop
do j = 1, 100
b(j) = 0.01 * real(j)
a(0, j) = 1.0
!$acc loop vector
do i = 1, 100
a(i, j) = a(i-1, j) * b(j) + c
enddo
enddo
!$acc end kernels

!$acc kernels loop
do j = 1, 100
d(101, j) = 2.0
!$acc loop vector
do i = 100, 1, -1
d(i, j) = d(i+1, j) * b(j) + c
enddo
enddo
!$acc end kernels

print *, "a(100, 100) = ", a(100, 100)
print *, "d(1, 100) = ", d(1, 100)

end program tst

Hi Wei,

Both of these are backward dependencies in that the assignment of the current iteration of the loop depends on a value computed in a previous iterations. While in the second loop you have “i+1”, the loop iterates in descending order so “i+1” iteration occurs before “i”.

For backward dependencies, there isn’t a way to parallelize them. The current iteration’s assignment depends on a value computed in a previous iteration. For these loops you’ll only be able to parallelize the outer “j”.

For forward dependencies, one way is to copy the values of the array to a temp array and use the temp array on the right hand side. “a(i,j) = aOld(i+1,j)”.

Hope this helps,
Mat

Mat,

Sorry, I did not get it.

Can you show me a little bit more on how to parallelize/vectorize the Fortran code below?

Thanks,

Wei


$ cat dep.F
program tst
implicit none
real, dimension(0:10000) :: a
real :: b
integer :: i

b = 1.2
a(0) = 1.0

!$acc kernels loop
do i = 1, 10000
a(i) = 1.001*(a(i-1) + b)
enddo
!$acc end kernels

print *, "a(10000) = ", a(10000)

end program tst


$ pgf90 -acc -Minfo=accel -Mprof=time -Mfree dep.F
tst:
10, Generating allocate(a(:))
Generating copyin(a(:9999))
Generating copyout(a(1:))
11, Loop carried dependence of a prevents parallelization
Loop carried backward dependence of a prevents vectorization
Accelerator scalar kernel generated

$ a.out
a(10000) = 2.6358178E+07

Hi Wei,

Can you show me a little bit more on how to parallelize/vectorize the Fortran code below?

This is a backwards dependency since the value assigned to “a(i)” requires the value of “a(i-1)” to be computed before it can be assigned. Hence, the ith iteration of the loop cannot be executed until the i-1 iteration is complete.

Such loops can not be parallelized since parallelization requires that loop iterations can be executed in any order.

You can still offload this loop to the target accelerator device, but it will be run sequentially. While not performant itself, you may still want to run a loop sequentially on the device if the cost of moving the data back and forth between the host and device is more that the extra time to run the loop sequentially.

Hope this helps,
Mat