Loop carried dependence

Hi all,

I have a very simple problem, I’m trying to accelerate the following loop using OpenAcc.

      IB(1)  = 1
      IBB(1) = 1
      DO 5 I =2,IMM1
      IB(I) = 1 + (I-1)/NSBLOCK
      NLEFT = IMM1 - (IB(I)-1)*NSBLOCK
      IF(NLEFT.LT.NSMID)  IB(I) = IB(I-1)
      IBB(I) = 1 + (I-1)/NBBLOCK
      NLEFT  = IMM1 - (IBB(I)-1)*NBBLOCK
      IF(NLEFT.LT.NBMID)  IBB(I) = IBB(I-1)
    5 CONTINUE

When I try to compile using OpenAcc, I get the following messages:

Loop carried dependence of ‘ib’ prevents parallelization
Loop carried backward dependence of ‘ib’ prevents vectorization
Loop carried dependence of ‘ibb’ prevents parallelization
Loop carried scalar dependence for ‘nleft’ at line 4622
Loop carried backward dependence of ‘ibb’ prevents vectorization

I’ve tried splitting the loop into as per the directions in this article (http://www.pgroup.com/lit/articles/insider/v1n2a1.htm), but I’m still having trouble with getting this loop offloaded to the accelerator.

Any pointers as to a work-around would be much appreciated. Apologies if I’ve missed something obvious, but this is all rather new to me!

Thanks,

Chris

Hi Chris,

Unfortunately, you have two backwards dependencies which will prevent parallelization.

...
IF(NLEFT.LT.NSMID)  IB(I) = IB(I-1)
...
IF(NLEFT.LT.NBMID)  IBB(I) = IBB(I-1) 
...

Unless the value of “IB(i-1)” is computed first, it’s value can’t be set to “IB(i)”. So unless you can remove these statements there’s no way to parallelize this loop.

Assuming this is part of a larger section of accelerated code, the question becomes is it faster to run the loop on the host and then copy the IB and IBB arrays to the device later, or run the loop sequentially on the device.

  • Mat