Problem with fixed-form Fortran OpenACC, daxpy

I’m working on accelerating the nuclear reaction integrator in the open source low Mach astrophysics code Maestro.

In the code we include the fixed-form Fortran source of LINPACK routines like daxpy*. I’m trying to mark it up with OpenACC as a simple sequential routine. However, I’m getting opaque errors that I’m hoping the forums can help me with. You can see the marked up code here. The error I’m getting is:

pgf95   -module t/Linux.PGI.omp.acc/m -It/Linux.PGI.omp.acc/m  -mp=nonuma -Minfo=mp -acc -Minfo=acc -O2  -I/home/ajacobs/Codebase/MAESTRO/Microphysics/EOS/helmeos  -c -o t/Linux.PGI.omp.acc/o/daxpy.o /home/ajacobs/Codebase/MAESTRO/Util/BLAS/daxpy.f
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Matching ref index not found (/home/ajacobs/Codebase/MAESTRO/Util/BLAS/daxpy.f: 1)
daxpy:
      1, Generating acc routine seq
     37, Loop is parallelizable
     51, Loop is parallelizable
     56, Loop is parallelizable
  0 inform,   0 warnings,   1 severes, 0 fatal for daxpy

This was compiled using PGI 15.10 via the OpenACCToolkit. It’s on a Linux workstation with a Maxwell GPU (GeForce GTX 960). I use this for development, the ultimate target is OLCF’s Titan supercomputer. Thanks for any assistance! This is my first time posting in the forums, so please let me know if I left out any essential info.

*I know there are GPU-aware libraries for such routines, but it’s not clear to me that we can compile them for the device and call from device. Using these libraries also complicates the need for our code to be architecture-agnostic, maximally portable, and maintained as a single codebase. Maybe we should use them, but we’d like to avoid it if possible.

Looks like a real bug in our device code generator. I can get it to compile by commenting out these two lines:

! IF (INCX.LT.0) IX = (-N+1)*INCX + 1
! IF (INCY.LT.0) IY = (-N+1)*INCY + 1

which doesn’t make any sense other than there’s a bug.

If I find a work-around I’ll let you know.

There is a device-side cublas, as you note. The calling conventions change, (cublasDaxpy instead of daxpy) and they take a handle as the first argument. These functions may launch other kernels, so if you have “lots” of work to do in these functions, you might consider that. It you want the work done by every thread, then compiling the existing library code using acc seq is probably the way to go.

Thank you for your reply and insight.

Changing the offending code to the following, for some reason, appears to fix the issue:

      if (incx.lt.0) then
         ix = 1
         ix = ix+(-n+1)*incx
      endif

So this is a workaround for the moment.