module foo
contains
pure function bar()
end function bar
end module foo
How can I tell nvfortran to inline bar? I’ve tried -Minline=bar and -Minline=foo_bar, but it doesn’t seem like the inliner is being used. bar is trivial, so it shouldn’t be failing inlining criteria.
Assuming that this is cross file and until we get IPA up and running again (it’s in progress), you’ll need to do a two pass compilation. First pass add “-Mextract=lib:” to extract the inlining information and then a second pass with “-Minline=lib:” to use this info to inline.
While this does now make the inline info available to the compiler, it’s not a guarantee that it will inline. Other factions such as the call depth, size, and if any array args need to be reshaped can limit inlining. You may need to adjust other settings via the -Minline sub-options (see “nvfortran -help -Minline” for the complete list).
We’re running into a similar issue at the moment and were wondering if it is possible to do a single pass for extracting and inlining?
We are using a parallel make environment, which is probably why the -Mipa=inline is not working.
But it would be great to have confirmation whether there is no other option as well.
Do you mean the two pass -Mextract/-Minline? -Mipa has been disabled for a few years now due to complications of integrating it into the LLVM back-end. The flag is still there for makefile compatibility but should be giving you a warning that it’s been deprecated.
if it is possible to do a single pass for extracting and inlining?
If all the source files are on the same compile command, then single pass can be used. But if each object is compiled separately, you need to use the two-pass method.
If we take this question more general: is there any other way to force inlining of subroutines in nvfortran besides the two-pass routine with -Mextract/-Minline? Anything like forceinline pragma or whatever other options?
I’m asking because inlining of subroutines seem to be mandatory (or highly desirable) in GPU code, so I’m just wondering if the compiler design implies any straight-forward solution or a choice of solutions for this rather general problem.
No, there’s no way to force inlining. You can give hints such as the “inline” keyword in C/C++ or “-Minline=”. Though for all the various ways inlining is performed, in all cases, the definition of the routine to be inlined needs to be visible when compiling the routine in which it is to be inlined.
In other words, “-Mextract” is the way to gather the information needed about routines not within the same compiling unit so the compiler can attempt to inline. It doesn’t force inlining.
The size of routine often affects if it can be inlined or not, The option “-Minline=maxsize:” can be used to increase the allowable size of an individual routine, and “-Minline=totalsize:” for the total size including inlining multiple levels of routines.
Thanks for explanations.
OK, I’m thinking on a different solution that will in fact force inlining of some target procedures. Since I’m interested only in a specific case of generating GPU coda using OpenMP offloading directives, this combination seem to work well:
File aaa.f90:
MODULE AAA
use iso_fortran_env
IMPLICIT NONE
PUBLIC BBB
INTERFACE BBB
MODULE PROCEDURE CCC
END INTERFACE
CONTAINS
FUNCTION CCC( pa, pb )
!$omp declare target
REAL(real64) :: pa,pb ! input
REAL(real64) :: CCC ! result
!!-----------------------------------------------------------------------
IF ( pb >= 0.e0) THEN ; CCC = ABS(pa)
ELSE ; CCC =-ABS(pa)
ENDIF
END FUNCTION CCC
END MODULE AAA
File testinl.f90:
module testinl
use omp_lib
use iso_fortran_env
use AAA
implicit none
contains
subroutine testinl()
real (real64) :: X
real (real64) :: Y
integer :: j, i
!$omp target teams distribute collapse ( 2 ) &
!$omp map ( to: Y ) map ( from: X )
do j = 1, 2000
do i = 1, 1000
X = Y * BBB(1.0_real64, 1.0_real64)
end do
end do
!$omp end target teams distribute
end subroutine testinl
end module testinl
program Test
use omp_lib
use iso_fortran_env
use testinl
implicit none
call testinl()
end program Test
The aaa.f90 file contains a function to inline. We use !omp declare target in its body. If we remove this directive, functions would not be inlined, and linker error on device code will happen.
Can you agree that this seem to be (a special case) solution for inlining device functions? Can you recommend anything else within this context?
The reason why the link is failing is because there’s no device subroutine for “ccc”. “declare target” is needed so the compiler to know it needs to creates a device callable version of the subroutine but does not implicitly inline it. To inline, the definition of the callee must be visible when compiling the caller which is not the case here.
To illustrate I removed the “declare target” from “ccc” so when attempting to compile, we get a link error given there’s no device version of “ccc”:
% nvfortran -fast -mp=gpu aaa.f90 testinl.f90
aaa.f90:
testinl.f90:
nvlink error : Undefined reference to 'aaa_ccc_' in '/tmp/nvfortranfxqOkpr0jKDpZ.o'
pgacclnk: child process exit status 2: /proj/nv/Linux_x86_64/23.9/compilers/bin/tools/nvdd
While it’s best practice to use “declare target”, we can instead inline the routine using the “-Minline” flag:
Notice that the file names are listed twice. This because the compiler is doing two passes. First extract the information from the source that’s needed for inlining, then compile the object using this information. While it too much to post, if you use the verbose (-v) flag, you can see in detail the different phases. “fort2ex” is the extract utility while “fort1” is the front-end compiler and “fort2” the back-end compiler.
When the source files are compiled separately, the compiler can’t implicitly perform this extraction. Instead the user must add an extract step to their build storing the results of the first pass in an inline library.
LTO does make this easier (and I hope our engineers will be able to support it again in the future), but it’s also doing two-passes. It first gathers the inline information during the first compilation and then at link time re-compiles all the source using this information.