Parallelizing with a fortran routine

Sorella · December 9, 2019, 11:22am

Dear support, I am in trouble with the following error:

Error: /tmp/pgaccnbreNboO1eAs.gpu (3256, 14): parse ‘@makefun_’ defined with type ‘void (i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*)*’
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (…/upnewwf_new.f90: 355)
PGF90/power Linux 19.10-0: compilation aborted

makefun.f90 is a routine declared seq in the call inside a loop of the following type:

!$acc parallel loop private(kpip,cphs)
do i=1,nshell
do ii=1,kgrid(i+ishift)%dimshell
indpar=indpar_tab(i)
indorb=indorb_tab(i)
indshell=indshell_tab(i)
kpip(1) = kgrid(i+ishift)%kpip(1,ii)
kpip(2) = kgrid(i+ishift)%kpip(2,ii)
kpip(3) = kgrid(i+ishift)%kpip(3,ii)
! update el-ion distances and modulus

…
call makefun( several parameters)

…
!$acc end parallel

I am confused because a similar simpler loop was compiled without problems:

do I=1,nshell

call makefun

enddo

and was also working.
Now there are two extra complication 1) the routine is inside another routine
after the ‘’ contains ‘’ statement 2) there is a structure indexing the loop , however it looks
this does not seem to be a problem because if I comment the call makefun, the PGI compiler
version 19.10 works.

Can you help me?

…

MatColgrove · December 9, 2019, 3:33pm

Hi Sorella,

the routine is inside another routine after the ‘’ contains ‘’ statement

This might be a problem as well since contained subroutines are passed a hidden stack pointer to the parent’s local variables. If this is from the host, the compiler would need to pass in a host stack pointer, which would fail on the device.

Though, it doesn’t fully explain why it works in the second case, unless the routine is getting inlined in the second case but not the first.

Unfortunately, I can’t really tell what’s going on from this bit of code. Can you provide a complete reproducing example so I can investigate?

Thanks,
Mat

Sorella · December 12, 2019, 5:08am

I have simplified the code and given three examples. The first one is not working, and the error message is clear now:

PGF90-S-0155-acc routine cannot be used for contained subprograms that refer to host subprogram data: n (prova.f90)
0 inform, 0 warnings, 1 severes, 0 fatal for norm_comp

Thus it seems it cannot be done at present.
I include three examples of the SAME algorithm (a poor implementation of a matrix x vector) the first one written in f90 with contains and the second one by passing after the contains all variables defined as in f77. If I cannot use OpenACC in f90 is a problem.
The two codes are only written in a different language and they should be equivalent for a reasonable compiler/paradigm (Pgf90/OpenACC).
If this is not possible I cannot port my application code in a reasonable time on GPU, because going back to f77 is a real pain.

Notice also that If I use the contains in the main, the program also works (third case). So, in my opinion it should be worth that Pgi/OpenACC could allow the first code working.

Many thanks for any help (it may be there is a workaround?).

This is the sample case that cannot be compiled:
program prova
#ifdef _OPENACC
use openacc
#endif
implicit none
integer n,i,j
real8, dimension(:,:), allocatable:: a
real8, dimension(:), allocatable:: b,c
real8 csum
#ifdef _OPENACC
integer mygpu, myrealgpu, num_devices, my_device_type
!$acc routine(norm_comp) seq
my_device_type = acc_device_nvidia
mygpu = 0
call acc_set_device_type(my_device_type)
num_devices = acc_get_num_devices(my_device_type)
write(6,) ’ Number of devices available: ',num_devices
call acc_set_device_num(mygpu,my_device_type)
write(6,) ‘Trying to use GPU:’,mygpu
myrealgpu = acc_get_device_num(my_device_type)
write(6,) 'Actually I am using GPU: ',myrealgpu
if(mygpu.ne.myrealgpu) then
write(6,) ‘I cannot use the requested GPU:’,mygpu
stop
endif
#endif
write(6,) ‘Input N leading dimension square matrix A’
read(,) N
allocate(b(N),c(N))
allocate(a(N,N))
do i=1,N
do j=1,N
a(i,j)=dsin(dble(i-j)/N)
enddo
a(i,i)=1.d0
b(i)=cos(dble(i)**2-3i+1) ! random init
enddo
call matvec(n,a,b,c)
csum=0.d0
!$acc parallel loop reduction(+:csum)
do i=1,N
csum=csum+c(i)**2
enddo
!$acc end parallel loop
csum=sqrt(csum)
!$acc parallel loop
do i=1,N
b(i)=c(i)/csum
enddo
!$acc end parallel loop
write(6,) ’ Final b → A b ’
do i=1,N
write(6,*) i,b(i)
enddo

stop
end program prova
subroutine matvec(n,a,b,c)
implicit none
integer i,n
real*8 csum,a(n,n),b(n),c(n)
!$acc routine(norm_comp) seq
!$acc parallel loop
do i=1,N
call norm_comp
! csum=0.d0
!!$acc loop reduction(+:csum)
! do j=1,N
! csum=csum+A(i,j)b(j)
! enddo
c(i)=csum
enddo
!$acc end parallel loop
contains
subroutine norm_comp
implicit none
real8 csum
integer j
!$acc routine seq
csum=0.d0
do j=1,N
csum=csum+A(i,j)*b(j)
enddo
end subroutine norm_comp
end subroutine matvec

Instead the following can be done by passing all the arguments of the subroutine as in a standard f77 case:

program prova
#ifdef _OPENACC
use openacc
#endif
implicit none
integer n,i,j
real8, dimension(:,:), allocatable:: a
real8, dimension(:), allocatable:: b,c
real8 csum
#ifdef _OPENACC
integer mygpu, myrealgpu, num_devices, my_device_type
!$acc routine(norm_comp) seq
my_device_type = acc_device_nvidia
mygpu = 0
call acc_set_device_type(my_device_type)
num_devices = acc_get_num_devices(my_device_type)
write(6,) ’ Number of devices available: ',num_devices
call acc_set_device_num(mygpu,my_device_type)
write(6,) ‘Trying to use GPU:’,mygpu
myrealgpu = acc_get_device_num(my_device_type)
write(6,) 'Actually I am using GPU: ',myrealgpu
if(mygpu.ne.myrealgpu) then
write(6,) ‘I cannot use the requested GPU:’,mygpu
stop
endif
#endif
write(6,) ‘Input N leading dimension square matrix A’
read(,) N
allocate(b(N),c(N))
allocate(a(N,N))
do i=1,N
do j=1,N
a(i,j)=dsin(dble(i-j)/N)
enddo
a(i,i)=1.d0
b(i)=cos(dble(i)**2-3*i+1) ! random init
enddo
call matvec(n,a,b,c)
csum=0.d0
!$acc parallel loop reduction(+:csum)
do i=1,N
csum=csum+c(i)**2
enddo
!$acc end parallel loop
csum=sqrt(csum)
!$acc parallel loop
do i=1,N
b(i)=c(i)/csum
enddo
!$acc end parallel loop

write(6,) ’ Final b → A b ’
do i=1,N
write(6,) i,b(i)
enddo

stop
end program prova
subroutine matvec(n,a,b,c)
implicit none
integer i,n
real8 csum,a(n,n),b(n),c(n)
!$acc routine(norm_comp) vector
!$acc parallel loop
do i=1,N
call norm_comp(i,n,a,b)
! csum=0.d0
!!$acc loop reduction(+:csum)
! do j=1,N
! csum=csum+A(i,j)b(j)
! enddo
c(i)=csum
enddo
!$acc end parallel loop
end subroutine matvec
subroutine norm_comp(i,n,a,b)
implicit none
integer i,j,n
real8 a(n,n),b(n)
real8 csum
!$acc routine seq
csum=0.d0
do j=1,N
csum=csum+A(i,j)*b(j)
enddo
end subroutine norm_comp

The third case with the main that includes with contains some subroutine supposed to
work in the accelerator:

program prova
#ifdef _OPENACC
use openacc
#endif
implicit none
integer n,i,j
real8, dimension(:,:), allocatable:: a
real8, dimension(:), allocatable:: b,c
real8 csum
#ifdef _OPENACC
integer mygpu, myrealgpu, num_devices, my_device_type
!$acc routine(norm_comp) seq
my_device_type = acc_device_nvidia
mygpu = 0
call acc_set_device_type(my_device_type)
num_devices = acc_get_num_devices(my_device_type)
write(6,) ’ Number of devices available: ',num_devices
call acc_set_device_num(mygpu,my_device_type)
write(6,) ‘Trying to use GPU:’,mygpu
myrealgpu = acc_get_device_num(my_device_type)
write(6,) 'Actually I am using GPU: ',myrealgpu
if(mygpu.ne.myrealgpu) then
write(6,*) ‘I cannot use the requested GPU:’,mygpu
stop
endif
#endif

write(6,*) ‘Input N leading dimension square matrix A’

read(,) N
allocate(b(N),c(N))
allocate(a(N,N))

do i=1,N
do j=1,N
a(i,j)=dsin(dble(i-j)/N)
enddo
a(i,i)=1.d0
b(i)=cos(dble(i)**2-3*i+1) ! random init
enddo
call matvec
csum=0.d0
!$acc parallel loop reduction(+:csum)
do i=1,N
csum=csum+c(i)**2
enddo
!$acc end parallel loop
csum=sqrt(csum)
!$acc parallel loop
do i=1,N
b(i)=c(i)/csum
enddo
!$acc end parallel loop

write(6,) ’ Final b → A b ’
do i=1,N
write(6,) i,b(i)
enddo

stop
contains
subroutine matvec
implicit none
integer i
!$acc routine(norm_comp) vector
!$acc parallel loop
do i=1,N
call norm_comp(n,i,csum,a,b)
! csum=0.d0
!!$acc loop reduction(+:csum)
! do j=1,N
! csum=csum+A(i,j)b(j)
! enddo
c(i)=csum
enddo
!$acc end parallel loop
end subroutine matvec
end program prova
subroutine norm_comp(n,i,csum,a,b)
implicit none
integer, intent(in):: N
real8, intent(out):: csum
real8, intent(in) :: b(N)
real8, intent(in) :: A(N,N)
integer i,j
!$acc routine seq
csum=0.d0
do j=1,N
csum=csum+A(i,j)*b(j)
enddo
end subroutine norm_comp

MatColgrove · December 12, 2019, 8:21pm

Hi Sorella,

Notice also that If I use the contains in the main, the program also works (third case). So, in my opinion it should be worth that Pgi/OpenACC could allow the first code working.

To reiterate, the problem here is that Fortran defines that a hidden argument to the parent’s stack pointer is passed to a contain subroutine. Hence when the contained device routine is contained in a host routine (as is the case in test #1), this means that the host stack address needs to passed in. If/when the device can access the host’s stack, then we may be able to support this, but not until.

The second test case (passing in the arguments) would be the work around.

For the third case, this program does work since both the parent and contained routines are on the host so the device doesn’t need direct access to the host stack. However, I would suggest managing the data movement. Here’s the modifications I made to your code:

% diff -u test3.org.f90 test3.
test3.f90      test3.org.f90
dev-sky5:/scratch/colgrove/RobA% diff -u test3.org.f90 test3.f90
--- test3.org.f90       2019-12-12 12:07:10.293654000 -0800
+++ test3.f90   2019-12-12 12:08:50.977676000 -0800
@@ -40,6 +40,9 @@
 a(i,i)=1.d0
 b(i)=cos(dble(i)**2-3*i+1) ! random init
 enddo
+
+!$acc data copyin(a) copy(b) create(c)
+
 call matvec
 csum=0.d0
 !$acc parallel loop reduction(+:csum)
@@ -54,6 +57,8 @@
 enddo
 !$acc end parallel loop

+!$acc end data
+
 write(6,*) ' Final b --> A b '
 do i=1,N
 write(6,*) i,b(i)
@@ -65,7 +70,7 @@
 implicit none
 integer i
 !$acc routine(norm_comp) vector
-!$acc parallel loop
+!$acc parallel loop present(a,b,c) private(csum)
 do i=1,N
 call norm_comp(n,i,csum,a,b)
 ! csum=0.d0

Hope this helps,
Mat

Sorella · December 13, 2019, 5:09am

The routine prova.f90 does not use the stuck and generally speaking it is not difficult to code without using the stuck for passing variables.
In my opinion the code prova.f90 should be compiled correctly if pgf90 does not stop for this reason. If you have an option to avoid this stop for stuck consistency I think I will be able to compile my complex code.

Also referring to your comment:

‘‘For the third case, this program does work since both the parent and contained routines are on the host so the device doesn’t need direct access to the host stack. However, I would suggest managing the data movement.’’

The routine norm_comp is defined on the device I suppose since I have put the explicit directive
soon after the declarations (line 88):
!$acc routine seq

Topic		Replies	Views
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20312	October 25, 2017
acc routine and Fortran Legacy PGI Compilers	6	14104	March 13, 2014
CUDA Fortran and Fortran 77 Legacy PGI Compilers	13	8221	March 12, 2012
undefined reference to `__pgi_uacc_computestart' Legacy PGI Compilers	8	7645	June 14, 2018
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1878	November 30, 2020
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13512	December 21, 2012
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	34	December 28, 2024
Programming with two languages (Please advise) Legacy PGI Compilers	11	6594	March 14, 2014
Unsupported local variable Legacy PGI Compilers	8	5036	January 26, 2018
About PGI Fortran and CUDA 4.0 Legacy PGI Compilers	24	12984	August 12, 2011

Parallelizing with a fortran routine

Related topics