Hi,

We are testing basic OpenACC implementation, mostly converting from OpenMP codes. Also testing Fortran/C with reduction in loop.

A following fortran code works good - even though this might not be the best practice. Any advice is welcomed.

```
!$acc data copyin(a(1:n)) copy(r(1:n),e(1:n))
!$acc kernels
!$acc loop reduction(+:npair, sum_acc)
do i=1, n
sum_loc = 0.0
do j=1,n
npair = npair + 1
r(i) = r(i)+dexp(a(i) + a(j))
e(i) = e(i)+dlog(a(i) + a(j))
sum_loc = sum_loc + r(i)*0.1d0 + e(i)*0.2d0
end do
sum_acc = sum_acc + sum_loc
enddo
!$acc end kernels
!$acc end data
!
```

-Minfo=accel message as follows:

41, Generating copy(e(1:n))

Generating copyin(a(1:n))

Generating copy(r(1:n))

42, Generating implicit copy(sum_acc,npair)

44, Loop is parallelizable

Accelerator kernel generated

Generating Tesla code

44, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

Generating reduction(+:sum_acc,npair)

46, !$acc loop seq

46, Complex loop carried dependence of r prevents parallelization

Loop carried dependence of r,e prevents parallelization

Loop carried backward dependence of r,e prevents vectorization

Complex loop carried dependence of e prevents parallelization

Inner sequential loop scheduled on accelerator

Then we converted into C, as shown below, but the pgcc says that it cannot parallelize - all loops are just sequential.

```
#pragma acc data copyin(a[0:N]) copy(r[0:N],e[0:N])
#pragma acc kernels
#pragma acc loop reduction(+:npair, sum_acc)
for (i=0;i<N;i++) {
sum_loc = 0.0;
for (j=0;j<N;j++) {
npair += 1;
r[i] += exp(a[i] + a[j]);
e[i] += log(a[i] + a[j]);
sum_loc += r[i]*0.1 + e[i]*0.2;
}
sum_acc += sum_loc;
}
```

The following message is from pgcc using -Minfo=acc.

43, Generating copy(e[:N])

Generating copyin(a[:N])

Generating copy(r[:N])

44, Generating implicit copy(sum_acc,npair)

46, Complex loop carried dependence of a->,r->,e-> prevents parallelization

Accelerator kernel generated

Generating Tesla code

46, #pragma acc loop seq

48, #pragma acc loop seq

48, Complex loop carried dependence of a->,r-> prevents parallelization

Loop carried dependence due to exposed use of r[i1],e[i1] prevents parallelization

Complex loop carried dependence of e-> prevents parallelization

I assume that we converted almost same loop from fortran to C while compiler responds very differently. Am I missing any operation? Any comments are appreciated.

Thanks,

BJ

PS. The version of PGI is 18.3-0 64-bit target on x86-64 Linux and we’re testing on P100 GPGPU card.