Call in OpenACC region to procedure 'pgf90_copy_f90_argl'

I’m stuck in trying to convert my code to openACC, previously openMP. When I compile with this:

pgfortran -Mpreprocess -Mnosecond_underscore -O0 -g -c -Minfo -Mneginfo -acc -ta=tesla:cc50,managed,lineinfo -o "$@" "$<"

where the make arguments are

appletonMod.o: ../appletonMod.f90 coordMod.o genParamsMod.o interpMod.o ionParamsMod.o typeSizes.o

I get the following errors:

PGF90-S-1000-Call in OpenACC region to procedure ‘pgf90_copy_f90_argl’ which has no acc routine information (…/appletonMod.f90: 944)
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Missing branch target block (…/appletonMod.f90: 1)

What is “pgf90_copy_f90_argl”? My uneducated guess is that it has to do with the assumed dimension scalar arrays being passed in/out of the subroutine.
This is the current offending line (module compiled when I commented it out, but obviously code won’t run correctly.):

call gridCoeff(iono%x,iono%y,iono%z,iono%Ne,&
    iono%xd,iono%yd,iono%zd,iono%xdp,iono%ixRow,iono%iyRow,iono%izRow, &
    p,fxyz,pf)

The subroutine gridCoeff is part of a 3D spline interpolation.
Here’s how it’s set up:

pure subroutine gridCoeff (xi,yi,zi,fi,&
    xd,yd,zd,xdp, ixrow,iyrow,izrow,pi,fxyz,pf)
!$acc routine seq

real(kind=dp), dimension(:), intent(in) :: xi, yi, zi
real(kind=dp), dimension(:,:,:), intent(in) :: fi, xdp
real(kind=dp), dimension(:,:), intent(in) :: xd, yd, zd
integer, dimension(:), intent(in) :: ixrow, iyrow, izrow
real(kind=dp), dimension(3), intent(in) :: pi
real(kind=dp), intent(out) :: fxyz
real(kind=dp), dimension(3), intent(out) :: pf
real(kind=dp), dimension(size(zi,dim=1)) :: zdp, fxy !, &
real(kind=dp), dimension(size(yi,dim=1),size(zi,dim=1)) :: ydp, fx
integer :: nrx, nry, nrz, ii, jj, kk, ll, i1, i2, j1, j2, & 
    k1, k2, xflag, yflag, zflag
real(kind=dp) :: eps, px, py, pz, hx, tx, dx1, dx2, hz, dz1, dz2, tz, & 
    hy, dy1, dy2, ty, c1, c2, dfdx, dfdy, dfdz, & 
    dc1dz, dc2dz, dc1dy, dc2dy, dc1dx, dc2dx
real(kind=dp) :: yb(size(yi,dim=1)-2), ys(size(yi,dim=1)-2), zb(size(zi,dim=1)-2), zs(size(zi,dim=1)-2)

The “iono” defined type contains both assumed dimension and allocatable scalar arrays (x,y,z, and Ne are passed in with their dimensions, xd, yd, zd, and others are allocated based on those sizes.) The subroutine gridCoeff calls one other subroutine, which is pretty scanadal-proof, just linear equation integrator with scalar array arguments, assumed dimensions. I’ve peppered acc routine directives almost everywhere.
I came across a post about C++ code that was getting a similar error (“Call in OpenACC region to procedure”) but could not infer enough from that to help me here, likely because I’m still a bit of a noob. https://forums.developer.nvidia.com/t/pgcc-s-1000-call-in-openacc-region-to-procedure-cxa-vec-c/135226/1
Any thoughts? No doubt I have not provided enough info here, but I’m not even sure what else to post. Any guidance is appreciated.
Thank you!

UPDATE (progress?):
i modified the above call to gridCoeff:

call gridCoeff(iono%x(1:nx),iono%y(1:ny),iono%z(1:nz),iono%Ne(1:nx,1:ny,1:nz), &
    iono%xd(1:nx-2,1:nx-2), iono%yd(1:ny-2,1:ny-2), iono%zd(1:nz-2,1:nz-2), &
    iono%xdp(1:nx,1:ny,1:nz+1), &
    iono%ixRow(1:nx-2), iono%iyRow(1:ny-2), iono%izRow(1:nz-2), &
    p(1:3),fxyz,pf(1:3))

And the compile error changed to:

PGF90-S-1000-Call in OpenACC region to procedure ‘pgf90_sect1’ which has no acc routine information (…/appletonMod.f90: 938)
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Missing branch target block (…/appletonMod.f90: 1)

Which is perhaps more mystifying to me.

Hi sdl,

Can you try compiling this code with optimization? (i.e. replace -O0 -g with -fast).

When passing in allocatable arrays to subroutines, we need to call a runtime routine (pgf90_copy_f90_argl) to determine if the array being passed is a non-contiguous sub-array. If so, then the runtime will create a temp array, pack the sub-array into a contiguous array, and then pass the temp array to the routine. If it is contiguous, then we pass in the array. Typically though at higher optimization levels, we can often get rid of the call if the compiler can determine the array is contiguous, which appears to be the case here.

-Mat

Hi Mat,

Thanks for the response. Unfortunately things have gotten worse.
I tried to compile with optimization, and that broke something further upstream. I’ll get to that in a moment.

I had guessed that pgf90_copy_f90_argl had something to do with the compiler trying to find the dimension or extent of the assumed dimension arrays, so I thought I could help it by explicitly specifying those in the call (e.g., instead of argument xi, call it with argument xi(1:nx)). And doing so did indeed make the error with the pgf90_copy_f90_argl go away. Or it at least it changed it the same error but with pgf90_sect1. So what does pgf90_sect1 do? Like your suggestion to compile with optimization, I was looking for additional hints about what I might have done wrong in this code. Another thing I tried was to comment out the entire body of the code: everything between the declarations of the arguments and the end statement, including all other variable declarations, and all operations. The error pgf90_sect1 error still occurred – so it definitely seems like something wrong with how I’ve written arguments.

Now, back to the new error from compiling with optimization. It’s in module that compiles fine with debugging on, and is required for the previously discussed module to build.

Unhandled builtin: 678 (pgf90_mzero8)
PGF90-F-0000-Internal compiler error. Unhandled builtin function. 0 (…/interpMod.f90: 182)

I tried replacing “-O0 -g” with “-fast”, as well as “-O1” and “-O2”, all with the same results. As a sanity check I double made sure that I could still build the same code with openMP and gfortran with optimization on (-O2), and it was fine.
The line where that unhandled builtin error occurs is at the end of what i thought was a fairly simple subroutine that is a linear equation solver:

pure subroutine dLinEqShort(a,b,x,irow)
real(kind=dp), dimension(:,:), intent(in) :: a
real(kind=dp), dimension(:), intent(in) :: b
real(kind=dp), dimension(:), intent(out) :: x
real(kind=dp), dimension(size(b)) :: q
integer, dimension(:), intent(in) :: irow
integer :: i,j,k,l,m,n
!$acc routine

n = size(a, dim=1)

if (n .eq. 1) then
   x(1)=b(1)/a(1,1)
   return
endif

x = 0.0d0
j=irow(1)
x(1)=b(j)/a(1,1)
do i=2,n
   j=irow(i)
   k=i-1
   do l=1,k
      x(i)=x(i)+a(i,l)*x(l)
   enddo
   x(i)=(b(j)-x(i))/a(i,i)
enddo
k=n-1
do i=1,k
   j=n-i
   m=j+1
   do l=m,n
      x(j)=x(j)-x(l)*a(j,l)
   enddo
enddo

end subroutine dLinEqShort

Any thoughts? Thanks again!

Hi sdl,

“pgf90_sect1” is the call to create a 1-D sub-array section. I’ve not actually seen this being called in a OpenACC routine before. Does this occur in the first code you posted or the second? I don’t see anything in the second example that would trigger this, except maybe the creation of the automatic array “q”. Though automatics should be fine, albeit not recommended since they cause every thread to malloc memory which can be quite slow.

For the second “pgf90_mzero8” problem, that’s most likely the line “x = 0.0d0”. mzero8 is a runtime call to zero out an array. The work around would be to write this as an explicit loop.

Can you send the code to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? It might help in determining exactly what’s going on.

-Mat

Hi Mat,

Thank you, looping the zeroing out of the vector fixed the other problem and allowed me to try to compile with -fast. And, the same error with the call to pgf90_sect1 in an OpenACC region remained.

Apologies, I think I muddled the sequence of errors.
My first post was about an error "Call in OpenACC region to procedure ‘pgf90_copy_f90_argl’ ". Having a hunch that argl had something to do with the compiler’s getting the sizes of the arguments (all assumed dimension 1D, 2D and 3D arrays), I changed the call such that each variable was called with with its size indicated. That is, I added “(1:nx)”, “(1:nx, 1:ny)” and so forth to each argument. Upon doing that, there was still a similar error at compile, in exactly the same place, but instead of ‘pgf90_copy_f90_argl’ the run-time procedure at issue became ‘pgf90_sect1’. I’ve since confirmed that one or the other error persists, depending on which of the two forms of the call is used (i.e., with or without argument dimensions specified), with or without optimization.

I’ve sent the code to the customer service request. Thank you very much for your help.

Hi sdl,

Mat is traveling this week and he asked me to look at your code. Oh boy, there’s a lot of stuff in these acc routines. I don’t see much parallel work, but perhaps it is higher or lower in the call chain. The use of all the registers (local variables), and in one case, allocate/deallocate will hamper performance on the GPU.

I think your specific issue is the call to gridCoeff. You don’t have an explicit interface to this subroutine, so the compiler assumes F77 calling conventions. So, it has to create contiguous arrays out of your data structures.

Even if you somehow get this to work, it will perform poorly as every thread is doing a ton of staging of arguments.

I think you need to take a step back, look for the parallelism in your code, and concentrate initially on getting that on the GPU. Then widen your scope and work outwards from there.

Hello brentl,
Thanks for the reply. You’ve put your finger on one of my on-going frustrations with this code.

First, somewhat as a shortcut, I’ve been taking advantage of automatic interfaces through module definitions and use, but if it will help streamline things (or get them to work at all), I can make interfaces explicit. The original code I inherited was in fact all Fortran 77, with assorted headaches, such as liberal use of common blocks and goto statements, and fixed array sizes that very much handicapped the array of possible use cases, and were a bit of a bookkeeping pain in coding, to boot. While it was an immense task to bring all the components somewhat closer to current conventions and standards, it would not surprise me in the least if more effort is needed there.

So, all that said, is there an easy-ish fix to telling the compiler to use a calling convention other than F77 that the modules are not doing? Direct interfaces? Or am I misunderstanding, and the issue is greater?

Second, yes, the code I sent is very, very far down the chain. In fact, it’s at the very bottom of the chain. Indeed I have taken the step back to identify the opportunities for parallelization.

You’re right – the subroutines GridCoeff and dLinEq are probably not in themselves parallelizable. Much earlier in the code, which I did not send, there is very natural parallelism and a large number of gangs is defined, each of which contains GridCoeff and dLinEq. GridCoeff and dLinEq comprise the interpolation of the background before the integration step (i.e., the workhorse) for a ray trace program (o.d.e. solver) that traces many, many completely independent rays through complicated media. And I have successfully parallelized this code for multi-core CPUs using OpenMP. Part of the power of the code is that it is so flexible in what environment types it can take on – though what gives it that power is the rather klunky interpolation.

Thus, in order to parallelize with OpenACC, the first step is to get the code to compile and run, however badly, on the GPU. Both portability and the diversity of problems I could take on with this program would be greatly improved if it could somehow utilize a GPGPU. So for the moment, the only issue with those routines is to get them to compile at all on the GPU.

Finally, in direct response to "Even if somehow get this to work", please note that at this point I’m in the very first Parallelize stage of the 3-step (iterative) cycle that is advocated in the seminars from October 2016 (and which I revisit often, looking for more hints). Of course I expect there will be improvements on each iteration – perhaps even some additional, significant re-writes – but only if I can somehow get it to work a first time … somehow.

To that end, I very much appreciate your help. Again, Thank you!

UPDATE: I’ve just modified GridX, GridCoeff and dLinEq such that allocatable arrays are not used, and dimensions are passed in explicitly as additional integer arguments. Also, I have added explicit interfaces to GridCoeff and GridX for the subroutines each calls. The compiler errors are exactly as they had been, which is not entirely surprising, given that the interfaces had been previously handled through the use of modules, anyway.

Hello again,
I’ve taken a slightly stripped down approach to the problem at the center of this thread, and managed to learn at least a tiny bit. The problem isn’t solved, but my hope is there’s enough light to see a way forward.

Because it wasn’t clear whether the problems i’ve been having were due to the gridCoeff and dLinEqShort subroutines, or maybe something in the data rat’s test further upstream, I built a new, much simpler program for troubleshooting. This new program builds and populates some data arrays, and then passes those to gridCoeff in a nested loop. I then compiled this program with and without acc turned on. With no acc, it compiles and runs fine. (No surprises there).

One benefit of the simpler program was the acc data directives were much more straight forward, so I didn’t have to rely on managed memory at all. While this simpler program compiles with acc, it breaks at run time. I’ve tracked it down to exactly where. It seems that the program can write to but not read from an automatic array.

In gridCoeff there is a loop that populates a 2D automatic array fx with values based on the scalars and 3D arrays passed in:

do kk=1,nrz
    do jj=1,nry
        fx(jj,kk)=fi(i2,jj,kk)*dx1-xdp(i2,jj,kk)*c1 &
            -fi(i1,jj,kk)*dx2+xdp(i1,jj,kk)*c2
    enddo
enddo

And what I’ve just found is that the first operation that tries to read data out of fx breaks the program at run-time:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

I even commented out the rest of the entire subroutine and put in a simple line:

 testDouble = fx(jj,kk)

and the program still broke at run-time.

Note that the arrays I’m working with are not overly large (40 x (30 x (20))), and the nested loops are not immense (200 iterations and 10 iterations). All the same I have added the flag -Mlarge_arrays, and it had no effect on the above.

Brentl previously noted that automatics and allocatables are not optimal, but it seems here they’re kryptonite to the gpu, at least in these very non-parallelizable subroutines in which an additional chunk of memory totaling about the size of the input data is required for local scalars and arrays. And while the input data are nearly all shared between threads (scalar 3-vector inputs are independent/private), the additional memory is required for each thread (i.e. necessarily private/indpendent)!

I’ve downloaded valgrind, as I’ve seen Mat mention it in a number of other threads with similar error messages to those in this update, but i haven’t yet built or run it.

Beyond running the code through valgrind, I’m currently considering 2 courses of action. The first, easier one, is to add a few array dummy arguments to each subroutine in this processing chain to replace all the automatics. These added arrays would have to be private in the acc loop. The second, slightly harder but potentially more robust (insofar as it is possible) option is to overhaul this series of nested loops such that only single-value scalars are required locally.

And of course I’m open to and immensely grateful for any suggestions, feedback or other input! My thin hope is that this update is somehow related to the original problem.

Hi sdl,

To clarify, the automatics are created within a device routine?

In that case, you’re probably blowing the heap. By default, the heap is only 8MB. So while each array is small, every thread will be creating one, and it can cause the heap to fill-up rather quickly. To fix, you can try setting the environment variable “PGI_ACC_CUDA_HEAPSIZE” to the total size in bytes of all your arrays (size of the array * number of threads) plus some extra.

Alternately, you can set the heap size via a CUDA call. See: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heap-memory-allocation


Yes, we strongly discourage the use of automatics in device code. Besides the small heap, dynamic allocation from the device is quite slow and can severely impact your performance.

-Mat

Hi Mat,
Thank you. This is all starting to make a lot more sense now. Correct, the automatics are in a device routine.

Before I scrap this interpolator from antiquity for something purpose-built for modern architectures, I want to try this next fix …

I changed the subroutines to take assumed dimension arrays as inout arguments, and marked those arrays as private in the acc loop. To build up slowly I did this for only one array first (fxtemp), and commented out any computation that affected any of the others:

        !$acc data copyin(iono, iono%fi, iono%x, iono%y, iono%z, iono%xdp, iono%xd, iono%yd, iono%zd, iono%ix, iono%iy, iono%iz) copyout(posit, vals, derivs) create(ybtemp,ystemp,ydptemp)
        ! 
        !$acc parallel loop collapse(2) private(fxtemp,ii,jj)
        do ii = 1,n1
            do jj = 1,n2
                call gridCoeff(iono%x, iono%y, iono%z, iono%fi, fxtemp, ybtemp, ystemp, ydptemp, &
                    iono%xd, iono%yd, iono%zd, iono%xdp, iono%ix, iono%iy, iono%iz, &
                    posit(1:3,ii,jj), vals(ii,jj), derivs(1:3,ii,jj) )
            enddo
        enddo
        !$acc end parallel loop
        !$acc end data

And that worked great. When I tried to also mark the next array (ybtemp) as private and uncomment those operations in gridCoeff that affect it, I got this error:

-Compiler failed to translate accelerator region (see -Minfo messages): No device symbol for address reference (subs.f90: 109)

If I put ybtemp back in the acc create directive it compiles fine, but then the invalid address error comes back.

There’s probably something simple/obvious about privates that I’m missing, but have not been able to find any other hints in the forum or manuals.

Thank you!

Hello,

I’ve learned a lot recently, but unfortunately the troubles are not done with me.

After revamping the entire subroutine chain to pass the arrays as dummy arguments and remove all allocatables and automatics, the compiler threw the same error (pgf90_copy_f90_argl in acc region). On the plus side, these changes led to a substantial speed up in the openMP-powered build, as all that array allocation and deallocation was essentially eliminated. So while I’m stuck in exactly the same place, there’s been some improvement elsewhere! Thank you for that.

To further troubleshoot the OpenACC build I inserted a dummy subroutine where the problematic one goes. This new dummy s/r ONLY takes in array dummy arguments, and performs no operations. Moreover, I stripped it down to the first set of dummy arguments, which are 1D arrays that are part of defined type:

 call testGrid( iono%x(1:nx), iono%y(1:ny), iono%z(1:nz) )

where

 pure subroutine testGrid (xi,yi,zi )
!$acc routine seq
real(kind=dp), dimension(:), intent(in) :: xi, yi, zi

This threw a similar error to those before:

Call in OpenACC region to procedure ‘pgf90_copy_f77_argl’ which has no acc routine information

That is, UNTIL I added substitute arrays, to which I passed all the values of the original, defined-type arrays, and passed THOSE into gridCoeff:

! ... 
real(kind=8) :: xx(size(iono%x)), yy(size(iono%y)), zz(size(iono%z))
do ii = 1,nx
    xx(ii) = iono%x(ii)
enddo 
do jj = 1,ny
    yy(jj) = iono%y(jj)
enddo 
do kk = 1,nz
    zz(kk) = iono%z(kk)
enddo 
call testGrid( xx, yy, zz )

Next I added the next dummy argument in the original gridCoeff list, which is iono%Ne(1:nx,1:ny,1:nz), and sure enough it broke the same way. But sure enough it worked with a substitute, local array into which the data values of iono%Ne are transferred. And so on with all the argument arrays.

So it seems that OpenACC really does not like the current data structure of the arguments. Those arrays in the defined type iono are arrays of pointers that point to arrays passed in through the c_types interface.
What do I need to tell OpenACC about those data that will get past this problem? Changing all the data structure of all the arguments as above will be very costly memory-wise (especially if it’s in the form of automatics). I’m okay with another substantial modification to some swath of the code, but at this point I’m, once again, at a loss as to what that might be. Thank you for any suggestions you can offer!

-sdl (Stephen)

In case you’re curious what’s in this code:
First, this code, which is a 3D tricubic spline interpolator, is very challenging to parallelize – I found a masters thesis on exactly this topic from only a few years ago, and a grip of papers on alternate methods. Because it seems so far that 1. those methods may not have the numerical “clean-ness” of the algorithm at hand and 2. the rest of the code is very sensitive to numerical noise, especially in the higher order derivatives and 3. i don’t want to build brand new code anyway, I’m still pushing forward with this implementation.