Complex loop carried dependence of 'd'

I’m attempting to compile the following code:

!$ACC REGION
!$ACC& LOCAL(loc1, loc2, nd, nd2, k, knd, total, zero)
!$ACC& COPYIN(a(:), ia(neq + 1), b(neq), ja(:))
!$ACC& COPY(d(neq))
do nd = 1, neq
loc1 = ia(nd) - 1
loc2 = ia(nd+1) - 1
knd = loc2 - loc1
! total = b(nd)
do k = 1, knd
nd2 = ja(loc1+k)
d(nd2) = d(nd2) + a(loc1+k)*b(nd)
end do
end do
!$ACC END REGION

I get the following errors from the accelerator:

67, Loop carried scalar dependence for ‘total’ at line 67
Scalar last value needed after loop for ‘total’ at line 69
73, No parallel kernels found, accelerator region ignored
77, Complex loop carried dependence of ‘d’ prevents parallelization
82, Complex loop carried dependence of ‘d’ prevents parallelization

I’m not really sure how to overcome this…

Hi Joshua,

The problem is that the value of nd2 could be the same for multiple threads. Hence, the actual value stored in “d(nd2)” will depend upon which ever thread was last to store the value and give you non-deterministic results. What you really need the compiler to do here is to create a private copy of ‘d’ for each thread and then perform a summation at the end of region. While we’re adding this support, it wont be available until the next major release.

A second issue that that your loops are triangular. Meaning that the inner loop bounds is calculated within the body of the outer loop (“knd = loc2 - loc1”). GPUs can only work with rectangular loops. To work around this, you’ll need to have the inner loop bounds set to the maximum value of knd and then use an if statement to skip the code if the value of k is > knd. For example:

knd = loc2 - loc1
do k = 1, max_knd
   if (k .le. knd) then
        ..
   end if
end do

The statement “!$ACC& COPY(d(neq))” tells the compiler to copy in and out only a single element of d. I’m assuming the you want the entire array so should use “d(:)”.

Scalars are always defined as being “LOCAL” so there is no need to use the local directive here. It doesn’t hurt, but is just redundant.

Finally, you may want to back-up and first evaluate if this section of code is worthwhile to send to the GPU. I see a lot of memory movement and little computation so will guess that the computational intensity of this loop is less than 1. I like to see at least an intensity of 4 before attempting to accelerate a region and prefer 10. One thing that might be helpful is to walk through the benchAMD tutorial that I wrote (See: http://www.pgroup.com/lit/articles/insider/v1n2a4.htm). It describes the process of determining the computational intensity of your loops.

Hope this helps,
Mat

Thanks Matt,
I think I can move the accelerated region up a level and get a lot more work done with less copying(this example is part of an iterative solver for a sparse matrix). I’ll look forward to the next major release. :)

Regards,
-Joshua

Hi Joshua,

Great. If you can push the parallelization out and make these loops serial, then it might map to the GPU better.

  • Mat

What you really need the compiler to do here is to create a private copy of ‘d’ for each thread and then perform a summation at the end of region.While we’re adding this support, it wont be available until the next major release.

I’m finding that I’m simply not able to get this code to work without the summation option. Any idea when the next major release might be available?

Currently, PGI 2010 (aka 10.0) is scheduled for release in November. Of course, as with all software development, this date could slip (but not usually) and some expected features may not be available in early builds.

While it will slow your code down, what I’ve done to work around the reduction issue, is to create temporary arrays to hold the intermediate calculations and then perform the reduction on the host. While not ideal, it will help in making continued progress in other areas.

  • Mat