Is this just a snip-it and you’ve cut out code? If not, then I’m wondering if “i” is uninitialized.

Oh wow, you’re right! Can’t believe I forgot this. I meant to have myid instead of i. Changing this got rid of the unwanted dependancies. Thanks Mat.

Now, at run time I still get some errors, which change from run to run. Perhaps you could tell me what some of these mean:

First run:

Segmentation fault (core dumped)

Second run:

call to cuMemHostUnregister returned error 713: Other

Error: _mp_pcpu_reset: lost thread

Third run:

call to cuMemcpyDtoHAsync returned error 700: Launch failed

Error: _mp_pcpu_reset: lost thread

In case it matters, the new compiler feedback and the full subroutine, respectively. myid is defined before the acc directive, so I think that is okay. Also the number 8404992 is just chunk, as my total particle number is twice that, so that seems okay. Adding copy statements to the code so that the entire arrays are copied still got the same errors.

```
pgfortran -fast -Msave -mp -acc -Minfo=accel -c slab.f
ppush:
288, Generating present_or_copyout(y3(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(y1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyout(x3(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(x1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(u2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyout(u3(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(u1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(ey(:,:))
Generating present_or_copyin(ex(:,:))
Generating present_or_copyin(rwx(:lr))
Generating present_or_copyin(x2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(rwy(:lr))
Generating present_or_copyin(y2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(mu(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(w2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(w1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyout(w3(myid*8404992+1:myid*8404992+8404992))
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
289, Loop is parallelizable
Accelerator kernel generated
289, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
295, Loop is parallelizable
```

```
c
subroutine ppush
use OMP_LIB
include 'slab.h'
real exp1,eyp
real wx0,wx1,wy0,wy1
integer m,i,j,myid,chunk
real rhog,vfac
real kap,th
c
! hardcoded for 2 threads
call omp_set_num_threads(2)
c
!$OMP PARALLEL PRIVATE(myid)
myid = OMP_GET_THREAD_NUM()
call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL
chunk=mm/2
dv=dx*dy
!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
do 100 m=myid*chunk+1,myid*chunk+chunk
c
exp1=0.
eyp=0.
rhog=sqrt(mu(m))/mims
do l=1,lr
xt=x2(m)+rwx(l)*rhog
yt=y2(m)+rwy(l)*rhog
if(xt.gt.lx) xt=2.*lx-xt
if(xt.lt.0.) xt=-xt
if(xt.eq.lx) xt=0.9999*lx
if(yt.ge.ly) yt=yt-ly
if(yt.le.0.) yt=yt+ly
if(yt.eq.ly) yt=0.9999*ly
if (ngp.eq.1) then
i=int(xt/dx+0.5)
j=int(yt/dy+0.5)
exp1=exp1 + ex(i,j)
eyp=eyp + ey(i,j)
else
i=int(xt/dx)
j=int(yt/dy)
c
wx0=float(i+1)-xt/dx
wx1=1.-wx0
wy0=float(j+1)-yt/dy
wy1=1.-wy0
c
exp1=exp1 + wx0*wy0*ex(i,j) + wx1*wy0*ex(i+1,j)
% + wx0*wy1*ex(i,j+1) + wx1*wy1*ex(i+1,j+1)
c
eyp=eyp + wx0*wy0*ey(i,j) + wx1*wy0*ey(i+1,j)
% + wx0*wy1*ey(i,j+1) + wx1*wy1*ey(i+1,j+1)
endif
enddo ! end loop over 4 pt avg
exp1=exp1/float(lr)
eyp=eyp/float(lr)
c
c
c LINEAR: epara=0. for no e para.
c
th=theta
c1 th=(x2(m)-0.5*lx)/ls
u3(m)=u1(m)+ epara*2.*dt*q*mims*(eyp*th)
c
vfac=0.5*( u2(m)*u2(m) + mu(m) )*mims/tets
c vfac=0.5*( u2(m)*u2(m) - 1.0 )*mims/tets
c
kap=( kapn -(1.5-vfac)*kapt )
c kap=( kapn + vfac*kapt )
c
c LINEAR: next 3 lines are commented out if linear...
c
x3(m)=x1(m)+ 2.*dt*( ecrossb*eyp )
y3(m)=y1(m)+ 2.*dt*( u2(m)*th + ecrossb*(-exp1) )
c
u1(m)=u2(m) + .25*( u3(m) - u1(m) )
x1(m)=x2(m) + .25*( x3(m) - x1(m) )
y1(m)=y2(m) + .25*( y3(m) - y1(m) )
if(x3(m).gt.lx) x3(m)=2.*lx-x3(m)
if(x3(m).lt.0.) x3(m)=-x3(m)
if(x3(m).eq.lx) x3(m)=0.9999*lx
if(y3(m).ge.ly) y3(m)=y3(m)-ly
if(y3(m).le.0.) y3(m)=y3(m)+ly
if(y3(m).eq.ly) y3(m)=0.9999*ly
c
c now, calculate weight for linearized case
c
c---------weigthing delft-f (dependent on ldtype)----
w3(m)=w1(m) + 2.*dt*( eyp*kap+q*tets*
% (th*eyp*u2(m)) )
% *(1-w2(m))
c
w1(m)=w2(m) + .25*( w3(m) - w1(m) )
c
100 continue
!$OMP END PARALLEL
90 continue
c
return
end
```