Routine has too high performance cost

vsingh96824 · September 15, 2016, 4:19pm

I have a subroutine that goes something like this

subroutine top

subroutine fun

  do i = 1, N
    do j = 1, P

       some work

       call A()
       call B()

       some work


     end do
   end do

end subroutine fun

I put openacc directives around the outer loops and I put acc routine seq inside A and B.
Most of the work is basically on some scalars, so it should go fast.

But compared to other subroutines this was pretty slow. Each call of the subroutine takes some 27 ms

So I tried to see what was taking time. If I comment out call A and call B it takes only 3 ms which is acceptable.

So I un-commented call A. It took 16 ms. But here’s where it gets weird. I commented out everything in A, so it’s not doing anything. And it took 14 ms.

I decided to investigate further. I removed all variables being passed to A and then gradually adding them back. Note that I have still commented out all the actual work, I am just adding back variable declarations inside A. And it seems every 2 or 3 variable adds 1 ms. This suggests the only efficient way to use routine would be to have no variables at all, which is not useful at all.

Is there a better way to do this?

MatColgrove · September 15, 2016, 5:30pm

Hi Vsingh,

There is overhead when making calls. The best way to work around this cost is not make the call and instead inline the routine. This can either be done manually or via the compiler flag “-Minline” when the routines are in the same file. When the routines are in separate files, you either need to put both files on the same compilation line or use IPA inlining, “-Mipa=inline”. Please see Chapter 5 of the PGI User’s guide, http://www.pgroup.com/doc/pgiug-x64.pdf, for full details.

Note that Inlining occurs before the OpenACC directives are applied so you’re actually inlining on the host, and then creating the device code.

Mat

vsingh96824 · September 16, 2016, 4:58pm

Hi Mat,

Thanks for the reply.

I expected there to be an overhead but the amount is almost an order higher, which I think is too much.

The subroutines are in different files but when I switch on Mipa=inline lots of other subroutines are being inlined and I lose the kind of granular view I have of the performance. Is it possible to make only that subroutine inline.

MatColgrove · September 17, 2016, 9:48pm

Is it possible to make only that subroutine inline.

Yes, though you’ll want to use “-Minline=name:”. To use -Minline across multiple files, either add the files on the same compilation line, our first perform an extract pass (i.e. compile all files with -Mextract=lib:inllib) then recompile with “-Minline=lib:inllib -Minline=name:”.

Mat

Topic		Replies	Views
OpenACC: -O2 and above gave wrong results Legacy PGI Compilers	4	4037	June 12, 2020
Loop contains call Legacy PGI Compilers	5	5927	November 7, 2015
function/procedure calls not supported Legacy PGI Compilers	5	7573	March 2, 2012
Kernel code not generated because function not inlined Legacy PGI Compilers	1	2242	February 12, 2013
Subroutines called within an accelerator region Legacy PGI Compilers	3	4625	August 1, 2014
function inline problem Legacy PGI Compilers	2	3565	April 15, 2014
Improving performance when calling subroutines inside of openmp teams regions with nvfortran nvc, nvc++ and nvfortran	13	168	June 9, 2026
Increased Compile time with function inlining Legacy PGI Compilers	0	555	August 10, 2022
Compiling and linking OpenACC in different files Legacy PGI Compilers	1	3857	March 11, 2014
openacc routine function efficiency Legacy PGI Compilers	1	3406	July 2, 2018

Routine has too high performance cost

Related topics