Confusing fortran accelerator problem

OS: CentOS release 5.3 (Final) x64
NVidia drivers: cudadriver_2.3_linux_64_190.18
PGI: 10.0

I am trying to help debug some code and having a few troubles. I have two pieces of code that are very identical but have very different results.

The first will always return an error 700 and occasionally (don’t know why) will cause a kernel panic in the nvidia drivers and crash the system (!).
The second always works, but since the calculations are different it obviously is not very useful.

I have been trying to figure this out today and would greatly appreciate any help understanding the problem.

[EDIT] Forgot to mention that line 133 is the k loop.

The first that fails:

!$acc region
      DO k=2,nz-2
        DO j=3,ny-2
          DO i=3,nx-2
            u(i,j,k,2)=-u(i,j,k,3)*temh
     :  -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
     :             (u(i-2,j,k,1)-ubar(i-2,j,k))))))
          END DO
        END DO
      END DO
!$acc end region



pgfortran  -fast -ta=nvidia -Minfo -c loop.f
looptest:
     76, Loop not vectorized/parallelized: loop count too small
     78, Unrolled inner loop 8 times
    130, Loop not vectorized/parallelized: contains call
    132, Generating copyin(ubar(1:nx,3:ny-2,2:nz-2))
         Generating copyin(u(1:nx,3:ny-2,2:nz-2,1:3)) 
         Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
    133, Loop is parallelizable
    134, Loop is parallelizable
    135, Loop is parallelizable
         Accelerator kernel generated
        133, !$acc do parallel
             Cached references to size [260x3] block of 'u'
             Cached references to size [260] block of 'ubar'
        134, !$acc do parallel
        135, !$acc do vector(256)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
	-lpapi



call to cuMemcpy2D returned error 700: Launch failed
Kernel Crash

The second that works:

!$acc region
      DO k=2,nz-2
        DO j=3,ny-2
          DO i=3,nx-2
            u(i,j,k,2)=-u(i,j,k,3)*temh
     :  -tema*(((((u(i,j,k,1)-ubar(i,j,k))-
     :             (u(i,j,k,1)-ubar(i,j,k))))))
          END DO
        END DO
      END DO
!$acc end region



pgfortran  -fast -ta=nvidia -Minfo -c loop.f
looptest:
     76, Loop not vectorized/parallelized: loop count too small
     78, Unrolled inner loop 8 times
    130, Loop not vectorized/parallelized: contains call
    132, Generating copyin(u(3:nx-2,3:ny-2,2:nz-2,1:3))
         Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
         Generating copyin(ubar(3:nx-2,3:ny-2,2:nz-2))
    133, Loop is parallelizable
    134, Loop is parallelizable
    135, Loop is parallelizable
         Accelerator kernel generated
        133, !$acc do parallel, vector(4)
        134, !$acc do parallel, vector(4)
        135, !$acc do vector(16)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
	-lpapi

Output is wrong but it does finish.

Hi CStackpole,

call to cuMemcpy2D returned error 700: Launch failed

This is normally caused by a seg fault when copying the data to the GPU. In this case, it’s most likely the compilers fault due to the different bounds being used to copy “u” in and out. Try using the “copy” directive to tell the compiler to copy in and out the entire “u” array. This will also give you better performance since the array can be copied in a single DMA transfer versus the multiple copies required to copy array segments.

I’m a bit surprised that the compiler didn’t detect the forward and backward loop dependencies in the “i” loop. (i.e the “u(i+2” and “u(i-2” references). Let’s also add a “seq” directive to sequential the “i”. Otherwise, you might get non-deterministic results.

Here’s what I suggest to try:

!$acc region copy(u, ubar)
      DO k=2,nz-2
        DO j=3,ny-2
!$acc do seq
          DO i=3,nx-2
            u(i,j,k,2)=-u(i,j,k,3)*temh
     :  -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
     :             (u(i-2,j,k,1)-ubar(i-2,j,k))))))
          END DO
        END DO
      END DO
!$acc end region

Also, do you mind sending your code to PGI Customer Support (trs@pgroup.com) and ask them to forward it to me? I’d like to confirm that I’m correct and send a report to our engineers.

Thanks,
Mat

Thanks for your reply.

I did as you asked and inserted that code. Now when it runs I get:

call to cuMemcpyDtoH returned error 700: Launch failed

However, after several runs I have yet to get a kernel crash. That is already a great step forward. :)

I will compose an email with the code right after I post.

Thank you so much for your time and help!

Hi Chris,

My bad. I had a typo in the code. It should be “!$acc seq” not “!$add seq”. I’ve fixed it above.

  • Mat

Hi Chris,

In looking at the code further, the compiler is correct in that there isn’t a dependency in the “i” loop. What I failed to notice was that the left-hand u uses “2” for the last dimension and the right-hand uses “1”. Hence, no dependency. Unfortunately, the “seq” is still needed to work around the “cuMemcpyDtoH” error.

The good news is that this error appears to have already been fixed internally so hopefully will be available in the next release (10.2 in February). I’ve assigned this to TPR#16479 for tracking purposes.

  • Mat

Thanks for your time and help in getting this working. I appreciate it!

Hi Chris,

FYI, I just verified that TPR#16479 has been fixed in 10.2.

  • Mat
% pgf90 -Minfo=accel loop.F -fast -ta=nvidia -o loop.out -V10.2; loop.out
looptest:
     87, Generating copyin(ubar(1:nx,3:ny-2,2:nz-2))
         Generating copyin(u(1:nx,3:ny-2,2:nz-2,1:3))
         Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
     89, Loop is parallelizable
     90, Loop is parallelizable
     94, Loop is parallelizable
         Accelerator kernel generated
         89, !$acc do parallel
             Cached references to size [260x3] block of 'u'
             Cached references to size [260] block of 'ubar'
         90, !$acc do parallel
         94, !$acc do vector(256)
 ok to here    4.233535989311349         522456.9
 ok to here   9.8799999750553980E-003   2.2387045E+08
 ok to here   9.8439999751462892E-003   2.2468915E+08
 ok to here   9.8299999751816358E-003   2.2500915E+08
 ok to here   9.8339999751715368E-003   2.2491763E+08
 ok to here   9.8289999751841606E-003   2.2503205E+08
 ok to here   9.8029999752498043E-003   2.2562890E+08
 ok to here   9.8339999751715368E-003   2.2491763E+08
 ok to here   9.8149999752195072E-003   2.2535302E+08
 ok to here   9.8189999752094081E-003   2.2526123E+08
 ok to here   9.8449999751437645E-003   2.2466634E+08
 ok to here   9.8239999751967844E-003   2.2514658E+08
 ok to here   9.8129999752245567E-003   2.2539896E+08
 ok to here   9.8049999752447548E-003   2.2558286E+08
 ok to here   9.8129999752245567E-003   2.2539896E+08
FORTRAN STOP