Complex loop carried dependence prevents parallelization

Hi,
By following the blog about “Deep Copy in OpenACC” at
https://www.pgroup.com/blogs/posts/deep-copy.htm
I am able to compile and run my openacc code likes

!$acc enter data copyin(var,m%detJ, m%dxidx)  
!$acc enter data copyin(m) attach(m%detJ, m%dxidx)  
!$acc enter data create(vartmp)  
!$acc parallel loop vector gang default(present) 
     do k=1,nk
       do j=1,nj
          do i=1,ni
             vartmp(i,j,k,1) = var(i,j,k)*m%detJ(i,j,k)*m%dxidx(i,j,k)
          enddo
       enddo
    enddo
    call acc_detach(m%detJ);
    call acc_detach(m%dxidx);
!$acc exit data delete(m%detJ,m%dxidx)  
!$acc exit data delete(m)   
!$acc exit data copyout(vartmp)

However, I get warning messages

Complex loop carried dependence of m%detj$p,vartmp,m%dxidx$p prevents parallelization


   1545, Generating enter data copyin(m%detj(:,:,:),m%dxidx(:,:,:),var(:,:,:))
   1546, Generating enter data attach(m%detj)
         Generating enter data copyin(m)
         Generating enter data attach(m%dxidx)
   1547, Generating enter data create(vartmp(:,:,:,:))
   1548, Accelerator kernel generated
         Generating Tesla code
       1549, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
       1550, !$acc loop seq
       1551, !$acc loop seq
   1548, Generating implicit present(m,vartmp(1:ni,1:nj,1:nk,1),var(:ni,:nj,:nk))
   1550, Complex loop carried dependence of m%detj$p,vartmp,m%dxidx$p prevents parallelization
   1551, Complex loop carried dependence of m%detj$p,vartmp,m%dxidx$p prevents parallelization
   1560, Generating exit data delete(m%detj(:,:,:),m%dxidx(:,:,:))
   1561, Generating exit data delete(m)
   1562, Generating exit data copyout(vartmp(:,:,:,:))

The do-loop indeed run in serial,

    1548: compute region reached 2505 times
        1548: kernel launched 2505 times
            grid: [1]  block: [128]
            elapsed time(us): total=8,593,687 max=3,831 min=3,274 avg=3,430
    1548: data region reached 5010 times

It takes 3400 ms to run the do-loop with ni=65, nj=195 and nk=1

How can we make the do-loop to run in parallel with such deep copy ?

Thanks. /JG

It is a little hard without the entire program, but I would first try dropping the attach operations.

!$acc enter data copyin(m, m%detJ, m%dxidx)

Specifying these in this order should allow the compiler to do the attach for you.

Delete them like you do:
!$acc exit data delete(m%detJ, m%dxidx, m)

Can you include the type of m? Are detJ and dxidx static, allocatable, or pointers?

Hi Brent,
If use

!$acc enter data copyin(m, m%detJ, m%dxidx)

I still get the compiling warning messages

1550, Complex loop carried dependence of m%detj$p,vartmp,m%dxidx$p prevents parallelization
   1551, Complex loop carried dependence of m%detj$p,vartmp,m%dxidx$p prevents parallelization

The code use very complex structures. the variables are

type(compElement),intent(inout),target         :: el
type(metric_fields),pointer                  :: m
m => el%metric
real(kind_dp), dimension(:,:,:), allocatable :: dxidx, detJ

If I rewrite the OpenACC directives likes

!$acc enter data copyin(var)
!$acc enter data copyin(el,el%metric,el%metric%detJ, el%metric%dxidx)
!$acc enter data create(vartmp) 
!$acc parallel loop vector gang default(present) 
    do k=1,nk
       do j=1,nj
          do i=1,ni
             vartmp(i,j,k,1) = var(i,j,k)*el%metric%detJ(i,j,k)*el%metric%dxidx(i,j,k) 
          enddo
       enddo
    enddo

The compiling warning messages disappeared, but the code crashes with

Application 3420607 exit signals: Illegal instruction

Thanks a lot. /JG

Hi jigo3635,

The warning messages are because of the use of pointers, which can be aliased. Hence, the compiler must assume that other pointers point to the same memory making the inner loops non-parallelizeable. The code is still getting offloaded, but only the outer loop is getting parallelized and given “nk=1”, it’s quite slow.

To fix, explicitly add the “loop” directive to all of the loops. Also, given the trip counts, I’d suggest trying something like:

!$acc parallel loop gang collapse(2) default(present) 
     do k=1,nk 
       do j=1,nj 
!$acc loop vector
          do i=1,ni 
             vartmp(i,j,k,1) = var(i,j,k)*m%detJ(i,j,k)*m%dxidx(i,j,k) 
          enddo 
       enddo 
    enddo

The “Application 3420607 exit signals: Illegal instruction” error is odd since it’s coming from the host side. It typically happens when compiling for a new architecture but running on an older CPU. Are you running on the same host that you built the binaries? Are you compiling with the “-tp” flag set?

-Mat

Hi Mat,

“nk=1”, it’s quite slow.

Yes, that is 2D case with “nk=1”. Now performance is improved much with “collapse”. And the other issue indeed is related to cross compiling with wrong modules. Thanks.

So, If I port the code with complex data structures to GPU using OpenACC, what is your recommendation: try to use

-ta=tesla:managed

(but got error with the flag even for cpu version, see
https://forums.developer.nvidia.com/t/call-to-cumemfreehost-returned-error-700-illegal-address-du/136008/1)

or use the method of explicitly manage the data like this thread, or even write a small subroutine to wrap the do-loop with only primary types of variables ?

Thanks again. /JG

Hi JG,

Using CUDA Unified Memory would certainly make things easier. It does have the caveat that only dynamic data (allocateable) is managed so “el” would still need to be manually manged using data regions, but all of el’s allocatable members would be managed by the runtime.

Performance-wise, unified memory is good so long as you’re not “ping-ponging”, i.e. frequently accessing data on the host and device, back and forth.

If you get this working, then you can go back and look at using data directives to better optimize the data movement.

but got error with the flag even for cpu version

Can you be more specific as to the nature of the crash? Is it segfaulting?

To me, this seems to be an indication that there’s some type of memory error in your CPU code, such as an out-of-bounds error. Since the data layout may change a bit when using the unified memory, an array may now reside next to a page boundary thus the out-of-bounds access causes a segv.

You might try running your original version under Valgrind (www.valgrind.org) to see if any memory error pop-out. You can also try compiling with “-Mbounds” to have the compiler perform array bounds checking.

If that’s not the issue, then I’d need more info and quite possibly a reproducing example.

Hope this helps,
Mat