Parallelizing with a fortran routine

Dear support, I am in trouble with the following error:

Error: /tmp/pgaccnbreNboO1eAs.gpu (3256, 14): parse ‘@makefun_’ defined with type ‘void (i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*)*’
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (…/upnewwf_new.f90: 355)
PGF90/power Linux 19.10-0: compilation aborted

makefun.f90 is a routine declared seq in the call inside a loop of the following type:

!$acc parallel loop private(kpip,cphs)
do i=1,nshell
do ii=1,kgrid(i+ishift)%dimshell
indpar=indpar_tab(i)
indorb=indorb_tab(i)
indshell=indshell_tab(i)
kpip(1) = kgrid(i+ishift)%kpip(1,ii)
kpip(2) = kgrid(i+ishift)%kpip(2,ii)
kpip(3) = kgrid(i+ishift)%kpip(3,ii)
! update el-ion distances and modulus


call makefun( several parameters)



!$acc end parallel

I am confused because a similar simpler loop was compiled without problems:

do I=1,nshell

call makefun

enddo

and was also working.
Now there are two extra complication 1) the routine is inside another routine
after the ‘’ contains ‘’ statement 2) there is a structure indexing the loop , however it looks
this does not seem to be a problem because if I comment the call makefun, the PGI compiler
version 19.10 works.

Can you help me?




Hi Sorella,

  1. the routine is inside another routine after the ‘’ contains ‘’ statement

This might be a problem as well since contained subroutines are passed a hidden stack pointer to the parent’s local variables. If this is from the host, the compiler would need to pass in a host stack pointer, which would fail on the device.

Though, it doesn’t fully explain why it works in the second case, unless the routine is getting inlined in the second case but not the first.

Unfortunately, I can’t really tell what’s going on from this bit of code. Can you provide a complete reproducing example so I can investigate?

Thanks,
Mat

I have simplified the code and given three examples. The first one is not working, and the error message is clear now:

PGF90-S-0155-acc routine cannot be used for contained subprograms that refer to host subprogram data: n (prova.f90)
0 inform, 0 warnings, 1 severes, 0 fatal for norm_comp

Thus it seems it cannot be done at present.
I include three examples of the SAME algorithm (a poor implementation of a matrix x vector) the first one written in f90 with contains and the second one by passing after the contains all variables defined as in f77. If I cannot use OpenACC in f90 is a problem.
The two codes are only written in a different language and they should be equivalent for a reasonable compiler/paradigm (Pgf90/OpenACC).
If this is not possible I cannot port my application code in a reasonable time on GPU, because going back to f77 is a real pain.

Notice also that If I use the contains in the main, the program also works (third case). So, in my opinion it should be worth that Pgi/OpenACC could allow the first code working.

Many thanks for any help (it may be there is a workaround?).

This is the sample case that cannot be compiled:
program prova
#ifdef _OPENACC
use openacc
#endif
implicit none
integer n,i,j
real8, dimension(:,:), allocatable:: a
real
8, dimension(:), allocatable:: b,c
real8 csum
#ifdef _OPENACC
integer mygpu, myrealgpu, num_devices, my_device_type
!$acc routine(norm_comp) seq
my_device_type = acc_device_nvidia
mygpu = 0
call acc_set_device_type(my_device_type)
num_devices = acc_get_num_devices(my_device_type)
write(6,
) ’ Number of devices available: ',num_devices
call acc_set_device_num(mygpu,my_device_type)
write(6,) ‘Trying to use GPU:’,mygpu
myrealgpu = acc_get_device_num(my_device_type)
write(6,
) 'Actually I am using GPU: ',myrealgpu
if(mygpu.ne.myrealgpu) then
write(6,) ‘I cannot use the requested GPU:’,mygpu
stop
endif
#endif
write(6,
) ‘Input N leading dimension square matrix A’
read(,) N
allocate(b(N),c(N))
allocate(a(N,N))
do i=1,N
do j=1,N
a(i,j)=dsin(dble(i-j)/N)
enddo
a(i,i)=1.d0
b(i)=cos(dble(i)**2-3i+1) ! random init
enddo
call matvec(n,a,b,c)
csum=0.d0
!$acc parallel loop reduction(+:csum)
do i=1,N
csum=csum+c(i)**2
enddo
!$acc end parallel loop
csum=sqrt(csum)
!$acc parallel loop
do i=1,N
b(i)=c(i)/csum
enddo
!$acc end parallel loop
write(6,
) ’ Final b → A b ’
do i=1,N
write(6,*) i,b(i)
enddo

stop
end program prova
subroutine matvec(n,a,b,c)
implicit none
integer i,n
real*8 csum,a(n,n),b(n),c(n)
!$acc routine(norm_comp) seq
!$acc parallel loop
do i=1,N
call norm_comp
! csum=0.d0
!!$acc loop reduction(+:csum)
! do j=1,N
! csum=csum+A(i,j)b(j)
! enddo
c(i)=csum
enddo
!$acc end parallel loop
contains
subroutine norm_comp
implicit none
real
8 csum
integer j
!$acc routine seq
csum=0.d0
do j=1,N
csum=csum+A(i,j)*b(j)
enddo
end subroutine norm_comp
end subroutine matvec


Instead the following can be done by passing all the arguments of the subroutine as in a standard f77 case:

program prova
#ifdef _OPENACC
use openacc
#endif
implicit none
integer n,i,j
real8, dimension(:,:), allocatable:: a
real
8, dimension(:), allocatable:: b,c
real8 csum
#ifdef _OPENACC
integer mygpu, myrealgpu, num_devices, my_device_type
!$acc routine(norm_comp) seq
my_device_type = acc_device_nvidia
mygpu = 0
call acc_set_device_type(my_device_type)
num_devices = acc_get_num_devices(my_device_type)
write(6,
) ’ Number of devices available: ',num_devices
call acc_set_device_num(mygpu,my_device_type)
write(6,) ‘Trying to use GPU:’,mygpu
myrealgpu = acc_get_device_num(my_device_type)
write(6,
) 'Actually I am using GPU: ',myrealgpu
if(mygpu.ne.myrealgpu) then
write(6,) ‘I cannot use the requested GPU:’,mygpu
stop
endif
#endif
write(6,
) ‘Input N leading dimension square matrix A’
read(,) N
allocate(b(N),c(N))
allocate(a(N,N))
do i=1,N
do j=1,N
a(i,j)=dsin(dble(i-j)/N)
enddo
a(i,i)=1.d0
b(i)=cos(dble(i)**2-3*i+1) ! random init
enddo
call matvec(n,a,b,c)
csum=0.d0
!$acc parallel loop reduction(+:csum)
do i=1,N
csum=csum+c(i)**2
enddo
!$acc end parallel loop
csum=sqrt(csum)
!$acc parallel loop
do i=1,N
b(i)=c(i)/csum
enddo
!$acc end parallel loop

write(6,) ’ Final b → A b ’
do i=1,N
write(6,
) i,b(i)
enddo

stop
end program prova
subroutine matvec(n,a,b,c)
implicit none
integer i,n
real8 csum,a(n,n),b(n),c(n)
!$acc routine(norm_comp) vector
!$acc parallel loop
do i=1,N
call norm_comp(i,n,a,b)
! csum=0.d0
!!$acc loop reduction(+:csum)
! do j=1,N
! csum=csum+A(i,j)b(j)
! enddo
c(i)=csum
enddo
!$acc end parallel loop
end subroutine matvec
subroutine norm_comp(i,n,a,b)
implicit none
integer i,j,n
real
8 a(n,n),b(n)
real
8 csum
!$acc routine seq
csum=0.d0
do j=1,N
csum=csum+A(i,j)*b(j)
enddo
end subroutine norm_comp

The third case with the main that includes with contains some subroutine supposed to
work in the accelerator:

program prova
#ifdef _OPENACC
use openacc
#endif
implicit none
integer n,i,j
real8, dimension(:,:), allocatable:: a
real
8, dimension(:), allocatable:: b,c
real8 csum
#ifdef _OPENACC
integer mygpu, myrealgpu, num_devices, my_device_type
!$acc routine(norm_comp) seq
my_device_type = acc_device_nvidia
mygpu = 0
call acc_set_device_type(my_device_type)
num_devices = acc_get_num_devices(my_device_type)
write(6,
) ’ Number of devices available: ',num_devices
call acc_set_device_num(mygpu,my_device_type)
write(6,) ‘Trying to use GPU:’,mygpu
myrealgpu = acc_get_device_num(my_device_type)
write(6,
) 'Actually I am using GPU: ',myrealgpu
if(mygpu.ne.myrealgpu) then
write(6,*) ‘I cannot use the requested GPU:’,mygpu
stop
endif
#endif

write(6,*) ‘Input N leading dimension square matrix A’

read(,) N
allocate(b(N),c(N))
allocate(a(N,N))



do i=1,N
do j=1,N
a(i,j)=dsin(dble(i-j)/N)
enddo
a(i,i)=1.d0
b(i)=cos(dble(i)**2-3*i+1) ! random init
enddo
call matvec
csum=0.d0
!$acc parallel loop reduction(+:csum)
do i=1,N
csum=csum+c(i)**2
enddo
!$acc end parallel loop
csum=sqrt(csum)
!$acc parallel loop
do i=1,N
b(i)=c(i)/csum
enddo
!$acc end parallel loop

write(6,) ’ Final b → A b ’
do i=1,N
write(6,
) i,b(i)
enddo

stop
contains
subroutine matvec
implicit none
integer i
!$acc routine(norm_comp) vector
!$acc parallel loop
do i=1,N
call norm_comp(n,i,csum,a,b)
! csum=0.d0
!!$acc loop reduction(+:csum)
! do j=1,N
! csum=csum+A(i,j)b(j)
! enddo
c(i)=csum
enddo
!$acc end parallel loop
end subroutine matvec
end program prova
subroutine norm_comp(n,i,csum,a,b)
implicit none
integer, intent(in):: N
real
8, intent(out):: csum
real8, intent(in) :: b(N)
real
8, intent(in) :: A(N,N)
integer i,j
!$acc routine seq
csum=0.d0
do j=1,N
csum=csum+A(i,j)*b(j)
enddo
end subroutine norm_comp

Hi Sorella,

Notice also that If I use the contains in the main, the program also works (third case). So, in my opinion it should be worth that Pgi/OpenACC could allow the first code working.

To reiterate, the problem here is that Fortran defines that a hidden argument to the parent’s stack pointer is passed to a contain subroutine. Hence when the contained device routine is contained in a host routine (as is the case in test #1), this means that the host stack address needs to passed in. If/when the device can access the host’s stack, then we may be able to support this, but not until.

The second test case (passing in the arguments) would be the work around.

For the third case, this program does work since both the parent and contained routines are on the host so the device doesn’t need direct access to the host stack. However, I would suggest managing the data movement. Here’s the modifications I made to your code:

% diff -u test3.org.f90 test3.
test3.f90      test3.org.f90
dev-sky5:/scratch/colgrove/RobA% diff -u test3.org.f90 test3.f90
--- test3.org.f90       2019-12-12 12:07:10.293654000 -0800
+++ test3.f90   2019-12-12 12:08:50.977676000 -0800
@@ -40,6 +40,9 @@
 a(i,i)=1.d0
 b(i)=cos(dble(i)**2-3*i+1) ! random init
 enddo
+
+!$acc data copyin(a) copy(b) create(c)
+
 call matvec
 csum=0.d0
 !$acc parallel loop reduction(+:csum)
@@ -54,6 +57,8 @@
 enddo
 !$acc end parallel loop

+!$acc end data
+
 write(6,*) ' Final b --> A b '
 do i=1,N
 write(6,*) i,b(i)
@@ -65,7 +70,7 @@
 implicit none
 integer i
 !$acc routine(norm_comp) vector
-!$acc parallel loop
+!$acc parallel loop present(a,b,c) private(csum)
 do i=1,N
 call norm_comp(n,i,csum,a,b)
 ! csum=0.d0

Hope this helps,
Mat

The routine prova.f90 does not use the stuck and generally speaking it is not difficult to code without using the stuck for passing variables.
In my opinion the code prova.f90 should be compiled correctly if pgf90 does not stop for this reason. If you have an option to avoid this stop for stuck consistency I think I will be able to compile my complex code.

Also referring to your comment:

‘‘For the third case, this program does work since both the parent and contained routines are on the host so the device doesn’t need direct access to the host stack. However, I would suggest managing the data movement.’’

The routine norm_comp is defined on the device I suppose since I have put the explicit directive
soon after the declarations (line 88):
!$acc routine seq