Fortran -> C in OpenACC


We are testing basic OpenACC implementation, mostly converting from OpenMP codes. Also testing Fortran/C with reduction in loop.

A following fortran code works good - even though this might not be the best practice. Any advice is welcomed.

    !$acc data copyin(a(1:n)) copy(r(1:n),e(1:n))
    !$acc kernels 
    !$acc loop reduction(+:npair, sum_acc)
    do i=1, n
       sum_loc = 0.0
       do j=1,n
          npair = npair + 1
          r(i) = r(i)+dexp(a(i) + a(j))
          e(i) = e(i)+dlog(a(i) + a(j))
          sum_loc = sum_loc + r(i)*0.1d0 + e(i)*0.2d0          
       end do
       sum_acc = sum_acc + sum_loc
    !$acc end kernels
    !$acc end data

-Minfo=accel message as follows:

41, Generating copy(e(1:n))
Generating copyin(a(1:n))
Generating copy(r(1:n))
42, Generating implicit copy(sum_acc,npair)
44, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
44, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Generating reduction(+:sum_acc,npair)
46, !$acc loop seq
46, Complex loop carried dependence of r prevents parallelization
Loop carried dependence of r,e prevents parallelization
Loop carried backward dependence of r,e prevents vectorization
Complex loop carried dependence of e prevents parallelization
Inner sequential loop scheduled on accelerator

Then we converted into C, as shown below, but the pgcc says that it cannot parallelize - all loops are just sequential.

#pragma acc data copyin(a[0:N]) copy(r[0:N],e[0:N])
#pragma acc kernels 
#pragma acc loop  reduction(+:npair, sum_acc)  
  for (i=0;i<N;i++) {
    sum_loc = 0.0;
    for (j=0;j<N;j++) {
      npair += 1;
      r[i] += exp(a[i] + a[j]);
      e[i] += log(a[i] + a[j]);
      sum_loc += r[i]*0.1 + e[i]*0.2;
    sum_acc += sum_loc;

The following message is from pgcc using -Minfo=acc.

43, Generating copy(e[:N])
Generating copyin(a[:N])
Generating copy(r[:N])
44, Generating implicit copy(sum_acc,npair)
46, Complex loop carried dependence of a->,r->,e-> prevents parallelization
Accelerator kernel generated
Generating Tesla code
46, #pragma acc loop seq
48, #pragma acc loop seq
48, Complex loop carried dependence of a->,r-> prevents parallelization
Loop carried dependence due to exposed use of r[i1],e[i1] prevents parallelization
Complex loop carried dependence of e-> prevents parallelization

I assume that we converted almost same loop from fortran to C while compiler responds very differently. Am I missing any operation? Any comments are appreciated.



PS. The version of PGI is 18.3-0 64-bit target on x86-64 Linux and we’re testing on P100 GPGPU card.


This is most likely due to pointer aliasing. C allows for pointers of the same type to point at the same object. In order to parallelize the loop, the compiler must prove that objects are disjoint, but because of the potential for aliasing, it can’t.

Try adding the C99 “restrict” attribute to you pointer declarations. “restricts” asserts to the compiler that the pointers don’t overlap.

Alternatively, you can use the flag “-Msafeptr” to assert that all pointers don’t alias, but this is a big hammer and may result in runtime errors if your pointers do indeed overlap.