OS: CentOS release 5.3 (Final) x64

NVidia drivers: cudadriver_2.3_linux_64_190.18

PGI: 10.0

I am trying to help debug some code and having a few troubles. I have two pieces of code that are very identical but have very different results.

The first will always return an error 700 and occasionally (don’t know why) will cause a kernel panic in the nvidia drivers and crash the system (!).

The second always works, but since the calculations are different it obviously is not very useful.

I have been trying to figure this out today and would greatly appreciate any help understanding the problem.

[EDIT] Forgot to mention that line 133 is the k loop.

The first that fails:

```
!$acc region
DO k=2,nz-2
DO j=3,ny-2
DO i=3,nx-2
u(i,j,k,2)=-u(i,j,k,3)*temh
: -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
: (u(i-2,j,k,1)-ubar(i-2,j,k))))))
END DO
END DO
END DO
!$acc end region
```

```
pgfortran -fast -ta=nvidia -Minfo -c loop.f
looptest:
76, Loop not vectorized/parallelized: loop count too small
78, Unrolled inner loop 8 times
130, Loop not vectorized/parallelized: contains call
132, Generating copyin(ubar(1:nx,3:ny-2,2:nz-2))
Generating copyin(u(1:nx,3:ny-2,2:nz-2,1:3))
Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
133, Loop is parallelizable
134, Loop is parallelizable
135, Loop is parallelizable
Accelerator kernel generated
133, !$acc do parallel
Cached references to size [260x3] block of 'u'
Cached references to size [260] block of 'ubar'
134, !$acc do parallel
135, !$acc do vector(256)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
-lpapi
```

```
call to cuMemcpy2D returned error 700: Launch failed
Kernel Crash
```

The second that works:

```
!$acc region
DO k=2,nz-2
DO j=3,ny-2
DO i=3,nx-2
u(i,j,k,2)=-u(i,j,k,3)*temh
: -tema*(((((u(i,j,k,1)-ubar(i,j,k))-
: (u(i,j,k,1)-ubar(i,j,k))))))
END DO
END DO
END DO
!$acc end region
```

```
pgfortran -fast -ta=nvidia -Minfo -c loop.f
looptest:
76, Loop not vectorized/parallelized: loop count too small
78, Unrolled inner loop 8 times
130, Loop not vectorized/parallelized: contains call
132, Generating copyin(u(3:nx-2,3:ny-2,2:nz-2,1:3))
Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
Generating copyin(ubar(3:nx-2,3:ny-2,2:nz-2))
133, Loop is parallelizable
134, Loop is parallelizable
135, Loop is parallelizable
Accelerator kernel generated
133, !$acc do parallel, vector(4)
134, !$acc do parallel, vector(4)
135, !$acc do vector(16)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
-lpapi
```

Output is wrong but it does finish.