Inner sequential loop scheduled on accelerator

The openacc code below failed to be parallelized. The compiler output this:
70, Loop carried dependence of ‘dx’ prevents parallelization
Loop carried backward dependence of ‘dx’ prevents vectorization
Inner sequential loop scheduled on accelerator

 !$acc kernels loop private(m,dx)
      do m=1,me
         zelectron0(1:5,m)=zelectron(1:5,m)
         dx(1)=1.0
         dx(2)=d_zelectron(1,1)
         dx(3:4)=dx(1:2)*2.0
      enddo

But if I changed

 dx(3:4)=dx(1:2)*2.0

to

 dx(3)=dx(1)*2.0
dx(4)=dx(2)*2.0

The code can be parallelized by the PGI compiler. So I am puzzled. Could anyone please explain this? Thanks in advance.

Hi Yang Zhao,

Do you have the full output from the compiler feedback? I’m thinking that the message is coming from the implicit array syntax loop and not the outer m loop. If this is the case, then the message is just informational and can be ignored.

  • Mat

Hi Mat,

Thanks for your reply. The full output from the compiler feedback is as follows:

     66, Loop is parallelizable
         Accelerator kernel generated
         66, !$acc loop gang ! blockidx%x
         67, !$acc loop vector(32) ! threadidx%x
         Loop is parallelizable
         70, Loop carried dependence of 'dx' prevents parallelization
         Loop carried backward dependence of 'dx' prevents vectorization
         Inner sequential loop scheduled on accelerator

So the code with line is:

 65 !$acc kernels loop private(m,dx)
 66      do m=1,me
 67         d_zelectron0(1:5,m)=d_zelectron(1:5,m)
 68         dx(1)=1.0
 69         dx(2)=d_zelectron(1,1)
 70         dx(3:4)=dx(1:2)*2.0
 71        ! dx(4)=dx(2)*2.0
               ...
           enddo
      !$acc end kernels

So the m loop is parallelizable and the line 70 cannot be vectorized. The compiler will do like this:

 dx(3)=dx(1)*2.0 
dx(4)=dx(2)*2.0

Am I right?[/code]

Another question: Is dx shared for all CUDA threads if dx is not included in private clause? If so, it will occur that one thread write to dx while another thread also read it. This will be a programming error. If we add dx in the private clause, the problem will be solved, is this right?

So what is happening is that the compiler is vectorizing the array expression. Given it’s small, you’ll want to override the compiler’s default schedule and replace it with

65 !$acc kernels loop gang vector private(dx)



Is dx shared for all CUDA threads if dx is not included in private clause? If so, it will occur that one thread write to dx while another thread also read it. This will be a programming error. If we add dx in the private clause, the problem will be solved, is this right?

Correct. Arrays are shared by default so not including dx in a private clause will cause a race condition.

Note that scalars are private by default so there’s no need to include “m” in the private clause.

  • Mat