We are testing basic OpenACC implementation, mostly converting from OpenMP codes. Also testing Fortran/C with reduction in loop.
A following fortran code works good - even though this might not be the best practice. Any advice is welcomed.
!$acc data copyin(a(1:n)) copy(r(1:n),e(1:n))
!$acc kernels
!$acc loop reduction(+:npair, sum_acc)
do i=1, n
sum_loc = 0.0
do j=1,n
npair = npair + 1
r(i) = r(i)+dexp(a(i) + a(j))
e(i) = e(i)+dlog(a(i) + a(j))
sum_loc = sum_loc + r(i)*0.1d0 + e(i)*0.2d0
end do
sum_acc = sum_acc + sum_loc
!$acc end kernels
!$acc end data
-Minfo=accel message as follows:
41, Generating copy(e(1:n))
Generating copyin(a(1:n))
Generating copy(r(1:n))
42, Generating implicit copy(sum_acc,npair)
44, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
44, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Generating reduction(+:sum_acc,npair)
46, !$acc loop seq
46, Complex loop carried dependence of r prevents parallelization
Loop carried dependence of r,e prevents parallelization
Loop carried backward dependence of r,e prevents vectorization
Complex loop carried dependence of e prevents parallelization
Inner sequential loop scheduled on accelerator
Then we converted into C, as shown below, but the pgcc says that it cannot parallelize - all loops are just sequential.
#pragma acc data copyin(a[0:N]) copy(r[0:N],e[0:N])
#pragma acc kernels
#pragma acc loop reduction(+:npair, sum_acc)
for (i=0;i<N;i++) {
sum_loc = 0.0;
for (j=0;j<N;j++) {
npair += 1;
r[i] += exp(a[i] + a[j]);
e[i] += log(a[i] + a[j]);
sum_loc += r[i]*0.1 + e[i]*0.2;
sum_acc += sum_loc;
The following message is from pgcc using -Minfo=acc.
43, Generating copy(e[:N])
Generating copyin(a[:N])
Generating copy(r[:N])
44, Generating implicit copy(sum_acc,npair)
46, Complex loop carried dependence of a->,r->,e-> prevents parallelization
Accelerator kernel generated
Generating Tesla code
46, #pragma acc loop seq
48, #pragma acc loop seq
48, Complex loop carried dependence of a->,r-> prevents parallelization
Loop carried dependence due to exposed use of r[i1],e[i1] prevents parallelization
Complex loop carried dependence of e-> prevents parallelization
I assume that we converted almost same loop from fortran to C while compiler responds very differently. Am I missing any operation? Any comments are appreciated.
PS. The version of PGI is 18.3-0 64-bit target on x86-64 Linux and we’re testing on P100 GPGPU card.