function/procedure calls not supported

Hi Mat,
I recently compiled a code the structure of which is as follows:
SUBROUTINE XYZ(…)
.
.
.
.
.
.
!$ACC REGION
!$ACC DO PRIVATE(N)
DO 110 N = 1,NUMEL
.
.
.
.
.
.
.
CALL ELMLIB(…)
.
.
110 CONTINUE
!$ACC END REGION
.
.
RETURN
END

I got the following error with reference to the call to the subroutine ELMLIB inside the structured block:
----Accelerator restriction: function/procedure calls not supported----

This, I presume, is with reference to the restriction that a program may not branch in or out of an accelerator region. At the same time, I cannot do away with the call to the subroutine inside the structured block. Could you suggest a workaround?

Thanks
Saumik.

Hi Saumik,

All subroutines need to be inline before they can used within an accelerator region. For details about compiler automatic inline, please refer to Chapter 4 of the PGI User’s Guide (https://www.pgroup.com/doc/pgiug.pdf). Basically, if the callee is located in the same file as the caller, you only need to add the flag “-Minline”. Otherwise, you need to either create an extract library (-Mextract) or use IPA inlining (-Mipa=inline). However, to every routine can be automatically inlined, so on occasion you many need to manually inline the routine.

Hope this helps,
Mat

Hi Mat,
Could you give an example of a code where a loop is being parallelized and there are a couple of function calls within the loop and the corresponding (function inlining) compiler flag(s) to be appended? This would help me get started.

Thanks
Saumik.

Hi Saumik,

Here’s a very trivial example, but should show you how to use the inlining flags.

Case 1: When the callee and caller are in the same file.

l% cat test.f90 
        function dosomething (x,y)
          real :: x,y
          real :: dosomething
          dosomething = x*y
        end function

        program foo
	real a(100), b(100)
!$acc region
	do i = 1,100
	 a(i) = float(i) * 100
	enddo

	do i = 1,100
	 b(i) = dosomething(a(i),a(i))
	enddo
!$acc end region
        print *, b(99), b(1)
	end
% pgf90 -ta=nvidia -Minfo=accel,inline test.f90 
PGF90-W-0155-Accelerator region ignored; see -Minfo messages  (test.f90: 9)
foo:
      9, Accelerator region ignored
     14, Accelerator restriction: function/procedure calls are not supported
     15, Accelerator restriction: unsupported call to 'dosomething'
  0 inform,   1 warnings,   0 severes, 0 fatal for foo
% pgf90 -ta=nvidia -Minfo=accel,inline -Minline test.f90 
foo:
      9, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     14, Loop is parallelizable
         Accelerator kernel generated
         14, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     15, dosomething inlined, size=2, file test.f90 (1)

Case 2: The callee and caller are in separate files. Compilation is done on same line.

l% cat do_mod.f90 
    module do_mod
     contains
        function dosomething (x,y)
          real :: x,y
          real :: dosomething
          dosomething = x*y
        end function
    end module do_mod

l% cat test1.f90

        program foo
        use do_mod
	real a(100), b(100)
!$acc region
	do i = 1,100
	 a(i) = float(i) * 100
	enddo

	do i = 1,100
	 b(i) = dosomething(a(i),a(i))
	enddo
!$acc end region
        print *, b(99), b(1)
	end
% pgf90 -ta=nvidia -Minfo=accel,inline -Minline do_mod.f90 test1.f90
do_mod.f90:
test1.f90:
do_mod.f90:
test1.f90:
foo:
      5, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
          6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     11, dosomething inlined, size=2, file do_mod.f90 (3)

Case 3: The callee and caller are in separate files. Separate compilation using IPA.

% pgf90 -c -Mipa=inline do_mod.f90 
% pgf90 -Mipa=inline -ta=nvidia -Minfo do_mod.o test1.f90
test1.f90:
PGF90-W-0155-Accelerator region ignored; see -Minfo messages  (test1.f90: 5)
foo:
      5, Accelerator region ignored
     10, Accelerator restriction: function/procedure calls are not supported
     11, Accelerator restriction: unsupported call to 'dosomething'
  0 inform,   1 warnings,   0 severes, 0 fatal for foo
IPA: no IPA optimizations for 1 source files
IPA: Recompiling test1.o: stale object file
foo:
      5, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
          6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     11, dosomething inlined, size=2 (IPA) file do_mod.f90 (3)

Case 4: The callee and caller are in separate files. Create an extract library.

% pgf90 -Mextract=lib:extlib do_mod.f90
% pgf90 -Mextract=lib:extlib test1.f90
% pgf90 -c do_mod.f90
% pgf90 -Minline=lib:extlib -ta=nvidia -Minfo -c test1.f90
foo:
      5, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
          6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     11, dosomething inlined, size=2, file do_mod.f90 (3)
l% pgf90 -ta=nvidia -Minfo test1.o do_mod.o

During development, I find using extract libraries as being the easiest method. Yes, it requires the extra extract step, but this only needs to be done once (unless the file changes). I can then simply recompile the source file with accelerator directives without needing to go through the link step with IPA just to see if my changes work.

Hope this helps,
Mat

Hi Mat,
The problem I am facing is that the subroutine definitions contain alternate return statements/data statements/format statements/assigned goto statements making them not inlinable. The fact remains that it is virtually impossible to tinker with these statements without destroying the existing structure of the code. Is there a workaround?

Thanks
Saumik.

Hi Saumik,

The problem I am facing is that the subroutine definitions contain alternate return statements/data statements/format statements/assigned goto statements making them not inlinable

Unfortunately these constructs also make then not suitable for a GPU.

Is there a workaround?

The other option is to push the acc directives into the subroutine and then use a data region and reflected directives to pass device data. Though, if the subroutine does not contain enough parallelization, you may not see much gain in performance.

The fact remains that it is virtually impossible to tinker with these statements without destroying the existing structure of the code.

While ideally porting to a GPU would require no changes to existing code, in your case it does seem some changes may be necessary.

  • Mat