Problem with using Fortran derived types within OpenACC kernel

Hello,
In the following code I am having a problem with printing bst_nr%b(1)%minf, while bst_nr%n is printed. What am I doing wrong?
Thanks,
Ilkhom

[ilkhom@t020 MCCC-FN-GPU_DEV]$ module load nvhpc/22.3
[ilkhom@t020 MCCC-FN-GPU_DEV]$ cat test.f90 
module bst_data
  implicit none
  type:: nr
     integer :: minf 
  end type nr 
  type:: basis 
    integer:: n
    type(nr), allocatable, dimension(:) :: b
  end type basis 
  type(basis) :: bst_nr   ! this is Sturmian basis  (depend on k and l)
!$acc declare create(bst_nr)
end module bst_data

subroutine acc_rout
!$acc routine seq
  use bst_data
  implicit none

  print*,'bst_nr%n=', bst_nr%n
  print*,'bst_nr%b(1)%minf=', bst_nr%b(1)%minf

end subroutine acc_rout

program test
  use bst_data
  implicit none
  integer :: i

  bst_nr%n=5
  allocate(bst_nr%b(1:bst_nr%n))
  do i=1,bst_nr%n
    bst_nr%b(i)%minf=2
  enddo

!$acc enter data copyin(bst_nr%n)
!$acc enter data copyin(bst_nr%b)
  do i=1,bst_nr%n
!$acc enter data copyin(bst_nr%b(i)%minf)
  enddo
!$acc update device(bst_nr)

!$acc kernels
!$acc loop
do i=1,1
  call acc_rout
enddo
!$acc end kernels

end program test
[ilkhom@t020 MCCC-FN-GPU_DEV]$ nvfortran -Minfo=accel -acc -ta=tesla:cuda11.6 test.f90 && ./a.out
acc_rout:
     14, Generating acc routine seq
         Generating NVIDIA GPU code
test:
     35, Generating enter data copyin(bst_nr%n)
     36, Generating enter data copyin(bst_nr%b(:))
     38, Generating enter data copyin(bst_nr%b$p%minf)
     40, Generating update device(bst_nr)
     44, Loop is parallelizable
         Generating NVIDIA GPU code
         44, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
 bst_nr%n=            5
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

The update directive is the problem. This performs a shallow copy of the UDT so you’re overwriting the device address of “b” with the host address.

!$acc enter data copyin(bst_nr%n)

This should be an update directive. “copyin” will create and copy the data if it’s not present. But since it is present, no action is taken and “n” will remain uninitialized.

Here’s the corrected code:

% cat test.F90
module bst_data
  implicit none
  type:: nr
     integer :: minf
  end type nr
  type:: basis
    integer:: n
    type(nr), allocatable, dimension(:) :: b
  end type basis
  type(basis) :: bst_nr   ! this is Sturmian basis  (depend on k and l)
!$acc declare create(bst_nr)
end module bst_data

subroutine acc_rout
!$acc routine seq
  use bst_data
  implicit none

  print*,'bst_nr%n=', bst_nr%n
  print*,'bst_nr%b(1)%minf=', bst_nr%b(1)%minf

end subroutine acc_rout

program test
  use bst_data
  implicit none
  integer :: i

  bst_nr%n=5
  allocate(bst_nr%b(1:bst_nr%n))
  do i=1,bst_nr%n
    bst_nr%b(i)%minf=2
  enddo
!$acc update device(bst_nr%n)
!$acc enter data copyin(bst_nr%b)
  do i=1,bst_nr%n
!$acc enter data copyin(bst_nr%b(i)%minf)
  enddo

!$acc kernels
!$acc loop
do i=1,1
  call acc_rout
enddo
!$acc end kernels

end program test
% nvfortran test.F90 -acc ; a.out
 bst_nr%n=            5
 bst_nr%b(1)%minf=            2

Hope this helps,
Mat

Thanks, Mat for the prompt reply (as always). Much appreciated! I also noticed that just printing bst_nr%b(i)%f(1:nr) from within the acc routine seq subroutine is very slow. Is it expected behaviour with derived type structures? In my case i~100 and nr~8000. f(:) is a double precision array.

I’m a bit surprised that putting an array in a print statement worked. There’s only basic support for printing from the device and I didn’t think the runtime call for printing arrays was supported. I just tried and see the expect error “Compiler runtime function not supported - pghpfio_ldw64”

In general codes should limit printing from the device. All the output needs to get buffered and copied back to the host to be printed. So yes, I’d expect printing out 800,000 doubles from the device would be slow.

Though the another issue is the access of “f” is non-contiguous for the threads (each thread is accessing a separate array). This will lead to a lot of stalls waiting for memory and cache thrash. If you put the print in an explicit loop and then make the routine “vector”, you can then get the threads to access the array contiguously. Something like:

module bst_data
  implicit none
  type:: nr
     integer :: minf
     real(8), allocatable, dimension(:) :: f
  end type nr
  type:: basis
    integer:: n
    integer :: nr
    type(nr), allocatable, dimension(:) :: b
  end type basis
  type(basis) :: bst_nr   ! this is Sturmian basis  (depend on k and l)
!$acc declare create(bst_nr)
end module bst_data

module routines
contains
subroutine acc_rout(i)
!$acc routine vector
  use bst_data
  implicit none
  integer,value :: i
  integer :: nrs,j
  nrs= bst_nr%nr
!$acc loop vector
  do j=1,nrs
    print*,'bst_nr%b(i)%f(j)=', bst_nr%b(i)%f(j)
  enddo

end subroutine acc_rout
end module routines
program test
  use bst_data
  use routines
  implicit none
  integer :: i

  bst_nr%n=100
  bst_nr%nr = 8000
  allocate(bst_nr%b(1:bst_nr%n))
  do i=1,bst_nr%n
    bst_nr%b(i)%minf=2
     allocate(bst_nr%b(i)%f(1:bst_nr%nr))
     bst_nr%b(i)%f = 1.0
  enddo
!$acc update device(bst_nr%n)
!$acc update device(bst_nr%nr)
!$acc enter data copyin(bst_nr%b)
  do i=1,bst_nr%n
!$acc enter data copyin(bst_nr%b(i)%f(:bst_nr%nr))
  enddo

!$acc kernels
!$acc loop gang
do i=1,bst_nr%n
  call acc_rout(i)
enddo
!$acc end kernels

end program test

I printed it using an explicit loop.

Thank you very much for advising an optimal way to work with derived type array. With the changes you suggested I got about ~200 times speedup in the main code.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.