Routine has too high performance cost

I have a subroutine that goes something like this

subroutine top

subroutine fun

  do i = 1, N
    do j = 1, P

       some work

       call A()
       call B()

       some work

     end do
   end do

end subroutine fun

I put openacc directives around the outer loops and I put acc routine seq inside A and B.
Most of the work is basically on some scalars, so it should go fast.

But compared to other subroutines this was pretty slow. Each call of the subroutine takes some 27 ms

So I tried to see what was taking time. If I comment out call A and call B it takes only 3 ms which is acceptable.

So I un-commented call A. It took 16 ms. But here’s where it gets weird. I commented out everything in A, so it’s not doing anything. And it took 14 ms.

I decided to investigate further. I removed all variables being passed to A and then gradually adding them back. Note that I have still commented out all the actual work, I am just adding back variable declarations inside A. And it seems every 2 or 3 variable adds 1 ms. This suggests the only efficient way to use routine would be to have no variables at all, which is not useful at all.

Is there a better way to do this?

Hi Vsingh,

There is overhead when making calls. The best way to work around this cost is not make the call and instead inline the routine. This can either be done manually or via the compiler flag “-Minline” when the routines are in the same file. When the routines are in separate files, you either need to put both files on the same compilation line or use IPA inlining, “-Mipa=inline”. Please see Chapter 5 of the PGI User’s guide,, for full details.

Note that Inlining occurs before the OpenACC directives are applied so you’re actually inlining on the host, and then creating the device code.

  • Mat

Hi Mat,

Thanks for the reply.

I expected there to be an overhead but the amount is almost an order higher, which I think is too much.

The subroutines are in different files but when I switch on Mipa=inline lots of other subroutines are being inlined and I lose the kind of granular view I have of the performance. Is it possible to make only that subroutine inline.

Is it possible to make only that subroutine inline.

Yes, though you’ll want to use “-Minline=name:”. To use -Minline across multiple files, either add the files on the same compilation line, our first perform an extract pass (i.e. compile all files with -Mextract=lib:inllib) then recompile with “-Minline=lib:inllib -Minline=name:”.

  • Mat