How can I avoid this crash?

I have the following loop in my program which is crashing w/o any kind of error message (so no out-of-memory-error or anything like that):

SUBROUTINE iterate_wo( td )! (p1,p2,q1,q2)->(p1',p2',q1',q2')
   USE params
   USE accel_lib
   USE precision
   USE data_arr

   IMPLICIT NONE
   INTEGER(KIND=8) :: i,j,td
   REAL(fp_kind),DIMENSION(4) :: xloc

   !$acc data region copy( x )
   !$acc region
   !$acc do parallel private( xloc )
   DO i=1,niniconds*niniconds
      xloc(:)=x(:,i)
      !$acc do seq
      DO j=1,td
         xloc(1)=         xloc(1)+k1*(1+e*COS(xloc(4)))*SIN(xloc(3))
         xloc(2)=         xloc(2)+k2*(1+e*COS(xloc(3)))*SIN(xloc(4))
         xloc(3)=xloc(3)+xloc(1)
         xloc(4)=xloc(4)+xloc(2)
      END DO
      x(:,i)=xloc(:) 
   END DO
   !$acc end region
   !$acc end data region

END SUBROUTINE iterate_wo

x is defined as

REAL(fp_kind), ALLOCATABLE,  DIMENSION(:,:) :: x

in the data_arr module and allocated like this:

n=niniconds*niniconds
ALLOCATE( x(4,n) )

where niniconds=256. td is usually quite large (around 10^7).

Now I assume that for each thread of the parallel loop a complete copy of x is adressed in the device memory. Is that true, and if so, how could I change the loop to use less memory?

Relevant compiler output:

iterate_wo:
     85, Generating copy(x(:,:))
     88, Loop is parallelizable
     89, Loop is parallelizable
         Accelerator kernel generated
         88, !$acc do parallel ! blockidx%y
         89, !$acc do parallel, vector(4) ! blockidx%x threadidx%x
     91, Complex loop carried dependence of 'xloc' prevents parallelization
         Loop carried reuse of 'xloc' prevents parallelization
         Accelerator kernel generated
         88, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
         91, !$acc do seq
             Non-stride-1 accesses for array 'xloc'
     97, Loop is parallelizable
         Accelerator kernel generated
         88, !$acc do parallel ! blockidx%y
         97, !$acc do parallel, vector(256) ! blockidx%x threadidx%x

Best regards,
mos

EDIT: fp_kind is defined like this:

integer, parameter :: Double = kind(0.0d0) ! Double precision
integer, parameter :: fp_kind = Double

Hi Mos,

My best guess as to the problem is that as you have it defined, you are creating three kernels. You really want one so that the same private xloc is used. As it is now, each kernel uses a different xloc. Try changing your directives to the following so that only the i loop is parallelized. The “kernel” directive tells the compiler to use the body of the loop as the device kernel.

   !$acc region
   !$acc do parallel, vector(256), kernel, private( xloc )
   DO i=1,niniconds*niniconds
      xloc(:)=x(:,i)
      DO j=1,td
         xloc(1)=         xloc(1)+k1*(1+e*COS(xloc(4)))*SIN(xloc(3))
         xloc(2)=         xloc(2)+k2*(1+e*COS(xloc(3)))*SIN(xloc(4))
         xloc(3)=xloc(3)+xloc(1)
         xloc(4)=xloc(4)+xloc(2)
      END DO
      x(:,i)=xloc(:)
   END DO
   !$acc end region

Note, try changing the vector size to see which size gives the best performance.

Hope this helps,
Mat