Hi Saumik,
Here’s a very trivial example, but should show you how to use the inlining flags.
Case 1: When the callee and caller are in the same file.
l% cat test.f90
function dosomething (x,y)
real :: x,y
real :: dosomething
dosomething = x*y
end function
program foo
real a(100), b(100)
!$acc region
do i = 1,100
a(i) = float(i) * 100
enddo
do i = 1,100
b(i) = dosomething(a(i),a(i))
enddo
!$acc end region
print *, b(99), b(1)
end
% pgf90 -ta=nvidia -Minfo=accel,inline test.f90
PGF90-W-0155-Accelerator region ignored; see -Minfo messages (test.f90: 9)
foo:
9, Accelerator region ignored
14, Accelerator restriction: function/procedure calls are not supported
15, Accelerator restriction: unsupported call to 'dosomething'
0 inform, 1 warnings, 0 severes, 0 fatal for foo
% pgf90 -ta=nvidia -Minfo=accel,inline -Minline test.f90
foo:
9, Generating copyout(b(:))
Generating copyout(a(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
10, Loop is parallelizable
Accelerator kernel generated
10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
14, Loop is parallelizable
Accelerator kernel generated
14, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
15, dosomething inlined, size=2, file test.f90 (1)
Case 2: The callee and caller are in separate files. Compilation is done on same line.
l% cat do_mod.f90
module do_mod
contains
function dosomething (x,y)
real :: x,y
real :: dosomething
dosomething = x*y
end function
end module do_mod
l% cat test1.f90
program foo
use do_mod
real a(100), b(100)
!$acc region
do i = 1,100
a(i) = float(i) * 100
enddo
do i = 1,100
b(i) = dosomething(a(i),a(i))
enddo
!$acc end region
print *, b(99), b(1)
end
% pgf90 -ta=nvidia -Minfo=accel,inline -Minline do_mod.f90 test1.f90
do_mod.f90:
test1.f90:
do_mod.f90:
test1.f90:
foo:
5, Generating copyout(b(:))
Generating copyout(a(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
6, Loop is parallelizable
Accelerator kernel generated
6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
10, Loop is parallelizable
Accelerator kernel generated
10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
11, dosomething inlined, size=2, file do_mod.f90 (3)
Case 3: The callee and caller are in separate files. Separate compilation using IPA.
% pgf90 -c -Mipa=inline do_mod.f90
% pgf90 -Mipa=inline -ta=nvidia -Minfo do_mod.o test1.f90
test1.f90:
PGF90-W-0155-Accelerator region ignored; see -Minfo messages (test1.f90: 5)
foo:
5, Accelerator region ignored
10, Accelerator restriction: function/procedure calls are not supported
11, Accelerator restriction: unsupported call to 'dosomething'
0 inform, 1 warnings, 0 severes, 0 fatal for foo
IPA: no IPA optimizations for 1 source files
IPA: Recompiling test1.o: stale object file
foo:
5, Generating copyout(b(:))
Generating copyout(a(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
6, Loop is parallelizable
Accelerator kernel generated
6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
10, Loop is parallelizable
Accelerator kernel generated
10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
11, dosomething inlined, size=2 (IPA) file do_mod.f90 (3)
Case 4: The callee and caller are in separate files. Create an extract library.
% pgf90 -Mextract=lib:extlib do_mod.f90
% pgf90 -Mextract=lib:extlib test1.f90
% pgf90 -c do_mod.f90
% pgf90 -Minline=lib:extlib -ta=nvidia -Minfo -c test1.f90
foo:
5, Generating copyout(b(:))
Generating copyout(a(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
6, Loop is parallelizable
Accelerator kernel generated
6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
10, Loop is parallelizable
Accelerator kernel generated
10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
11, dosomething inlined, size=2, file do_mod.f90 (3)
l% pgf90 -ta=nvidia -Minfo test1.o do_mod.o
During development, I find using extract libraries as being the easiest method. Yes, it requires the extra extract step, but this only needs to be done once (unless the file changes). I can then simply recompile the source file with accelerator directives without needing to go through the link step with IPA just to see if my changes work.
Hope this helps,
Mat