The openacc code below failed to be parallelized. The compiler output this:
70, Loop carried dependence of ‘dx’ prevents parallelization
Loop carried backward dependence of ‘dx’ prevents vectorization
Inner sequential loop scheduled on accelerator
!$acc kernels loop private(m,dx)
do m=1,me
zelectron0(1:5,m)=zelectron(1:5,m)
dx(1)=1.0
dx(2)=d_zelectron(1,1)
dx(3:4)=dx(1:2)*2.0
enddo
But if I changed
dx(3:4)=dx(1:2)*2.0
to
dx(3)=dx(1)*2.0
dx(4)=dx(2)*2.0
The code can be parallelized by the PGI compiler. So I am puzzled. Could anyone please explain this? Thanks in advance.
Hi Yang Zhao,
Do you have the full output from the compiler feedback? I’m thinking that the message is coming from the implicit array syntax loop and not the outer m loop. If this is the case, then the message is just informational and can be ignored.
Hi Mat,
Thanks for your reply. The full output from the compiler feedback is as follows:
66, Loop is parallelizable
Accelerator kernel generated
66, !$acc loop gang ! blockidx%x
67, !$acc loop vector(32) ! threadidx%x
Loop is parallelizable
70, Loop carried dependence of 'dx' prevents parallelization
Loop carried backward dependence of 'dx' prevents vectorization
Inner sequential loop scheduled on accelerator
So the code with line is:
65 !$acc kernels loop private(m,dx)
66 do m=1,me
67 d_zelectron0(1:5,m)=d_zelectron(1:5,m)
68 dx(1)=1.0
69 dx(2)=d_zelectron(1,1)
70 dx(3:4)=dx(1:2)*2.0
71 ! dx(4)=dx(2)*2.0
...
enddo
!$acc end kernels
So the m loop is parallelizable and the line 70 cannot be vectorized. The compiler will do like this:
dx(3)=dx(1)*2.0
dx(4)=dx(2)*2.0
Am I right?[/code]
Another question: Is dx shared for all CUDA threads if dx is not included in private clause? If so, it will occur that one thread write to dx while another thread also read it. This will be a programming error. If we add dx in the private clause, the problem will be solved, is this right?
So what is happening is that the compiler is vectorizing the array expression. Given it’s small, you’ll want to override the compiler’s default schedule and replace it with
65 !$acc kernels loop gang vector private(dx)
Is dx shared for all CUDA threads if dx is not included in private clause? If so, it will occur that one thread write to dx while another thread also read it. This will be a programming error. If we add dx in the private clause, the problem will be solved, is this right?
Correct. Arrays are shared by default so not including dx in a private clause will cause a race condition.
Note that scalars are private by default so there’s no need to include “m” in the private clause.